View profile

The evolution of skill discovery in virtual assistants

The evolution of skill discovery in virtual assistants
Dear Readers,
Last Friday, I was enjoying a glass of wine at my friends’ apartment when I caught a glimpse of a newly purchased Google Home that sat in the corner. My curiosity for all embodied AIs took over and I asked them what they use their new tech for. “Besides playing music,” they answered, “We ask it for the time, the weather, and…to turn the lights on and off.”
Like many consumers, my friends have encountered the skill discovery problem in voice-based smart technology. While a smart speaker can do more than report the weather, turn on the lights, and order food—4,200 things for Google Assistant alone—it cannot effectively communicate the countless ways in which it can assist us. Since skill discovery is a crucial element in making virtual assistants more effective and humanlike, we will explore the evolution of skill discovery, covering its challenges, current progress, and how it will continue to develop in the future.

Skill discovery in virtual assistants (Illustration: Luniapilot)
Skill discovery in virtual assistants (Illustration: Luniapilot)
Skill Discovery: 2 challenges
Benedict Evans (Andreessen Horowitz) calls skill discovery in smart speakers a fundamental UX puzzle: the Alexa audio-only interface is convenient, for example, until you expect it to recite its 80,000 skills one by one to a user.
There are 2 factors that make skill discovery especially challenging:
  • Availability: virtual assistants’ skillsets are rapidly expanding. Voicebot reports that since 2018, Google Assistant’s capabilities increased by 2.5 times to 4,253 actions and Alexa’s increased by 2.2 times to almost 80,000.
  • Affordances: users are unsure about what their virtual assistants are capable of, leading to misaligned expectations, and many of them neglecting to use the internet to understand the full breadth of their assistant’s skills.
8 ways for virtual speakers to help people discover all they can do 
In this article, Ryen W. White (Microsoft Research) lists 8 improvements to virtual speakers so that users can more easily discover new skills for their daily needs:
  • Be proactive: instead of reactive, user-initiated interaction, enable virtual assistants to proactively engage users with their skills.
  • Timing is everything: present users with skill suggestions during the moment-of-need to ensure that these capabilities are more likely to be remembered in the future.
  • Use contextual and personal signals: leverage a combination of the user’s contextual and personal signals, including long-term habits and patterns.
  • Examine additional signals: contextual and personal information that is not yet accessible, such as human activities only observable with vision, are untapped opportunities for personalized recommendations.
  • Consider privacy and utility: offer the right help at the right moment and proactively attribute it to the permissioned data access via recommendation explanations.
  • Permit multiple recommendations: suggest multiple skills when the recommendation model’s confidence is below a certain threshold at which a definitive skill would typically be suggested.
  • Leverage companion devices: allow access to various screens through WiFi or Bluetooth connectivity, such as smartphones, tablets, or desktop PC’s, to enrich context and assist in providing more-relevant skill suggestions.
  • Support continuous learning: suggest new skills based on previous patterns of activity.
Alexa Conversations claims to utilize machine learning to predict a user's true goal from the dialogue and proactively enable the conversation flow across skills. (Credit: Alexa Blog)
Alexa Conversations claims to utilize machine learning to predict a user's true goal from the dialogue and proactively enable the conversation flow across skills. (Credit: Alexa Blog)
Progress on improving skill discovery
Most recently, Amazon introduced Alexa Conversations, a Deep Learning approach that allows developers to more effectively improve skill discovery with less effort, fewer lines of code, and less training data. While it is still in “preview”, Alexa Conversations has already generated considerable excitement among developers who build skills for the smart speaker.
Essentially, Alexa Conversations aims to establish a more natural and fluid interaction between Alexa and its users within a single skill. In future releases, the software is expected to bring multiple skills into a single conversation. It also claims to be able to handle ambiguous references, such as, “Are there any Italian restaurants nearby?” (near where?), as well as context preservation when transitioning from one skill to another, such as remembering the location of a certain movie theater when suggesting nearby restaurants.
At Amazon’s re:MARS AI and ML conference in June, Rohit Prasad, VP and head scientist at Alexa, mentioned that Alexa Conversations’ machine learning capabilities can help it predict a customer’s true intention and goal from the direction of the dialogue, thus proactively enabling flow across multiple skills during conversation. If these promises are met, the command-query interaction with Alexa will surely begin to feel more like a natural human interaction.
The Future: Seeing and embodied virtual assistants
The progress made by the Alexa team is surely exciting, but conversational AI is not the only area with room for improvement. At Embodied AI we endorse the integration of conversational AI and video understanding into an anthropomorphically embodied assistant brought about through the addition of a camera and screen to the existing speaker interface. 
As our understanding of both natural language processing and computer vision continues to advance, there is little reason to limit virtual assistants to audio. The recent release of Amazon Echo Show 5, Facebook Portal and Google’s Nest Hub Max, all of which come with a camera and a screen, already foreshadow the industry’s movement towards virtual assistants that one day can see and be seen. One could reasonably speculate that the big tech companies are working on visually-enabled and embodied virtual assistants to replace their smart speakers in the near future. It’s a natural extension of their existing product lines.
Benefits of virtual assistants with a camera, screen, and anthropomorphic embodiment include:
  • Multimodal I/O: instead of being restricted to audio, virtual assistants equipped with both speech I/O and video I/O are empowered with greater intelligence and a more engaging graphic user interface.
  • Improved skill discovery experience: leveraging computer vision captures contextual and personal signals currently untapped by audio-only devices, allowing the transition from user-initiated interaction to proactive assistance.
  • Companion instead of a servant: with digital, human-like bodies, virtual assistants will no longer be perceived as servants, but rather as helpers. While this does not directly improve skill discovery, it enriches the overall virtual assistant experience.
Roland Memisevic, TwentyBN's CEO, believes that computer vision, by unlocking context awareness for virtual assistants, will shift the assistant paradigm from query-response to memory-infused companionship. (Credit: LDV Capital)
Roland Memisevic, TwentyBN's CEO, believes that computer vision, by unlocking context awareness for virtual assistants, will shift the assistant paradigm from query-response to memory-infused companionship. (Credit: LDV Capital)
Roland Memisevic, TwentyBN’s CEO, envisions a future where our conversations with virtual assistants will, unlike with the current smart speakers, not feel like phone-calls:
Embodied avatars will not necessarily need wake words but can consistently be here, see, and listen, especially when they are edge-powered and free of privacy concerns. Using computer vision to unlock context awareness for virtual assistants, we will shift the assistant paradigm from query-response to memory-infused companionship. Asking our future companions about what skills they have will feel as ludicrous as asking your best friend if they breath oxygen.
“Hey Google, let’s wrap this up!”
Perhaps on another Friday evening, sometime in the near future, I will revisit his friends’ rooftop flat and discover a new virtual assistant, one that is not only well-versed in conversation, but also equipped with eyes for understanding context and identifying needs not captured by words. Perhaps it might even have a digitized human body, becoming a virtual friend, who shares and adds to the lively atmosphere of a mid-summer’s night in Berlin.
As I reach to take a sip from my glass of wine and discover that it is empty, I will hear my friends’ Google Assistant call from the corner, “I think we’re ready for another bottle of the white wine!”
AI Avatars
  • Rivalry: A name clash between Xiaomi’s Mimoji mobile avatar feature and Apple’s Memoji (VentureBeat)
  • Tips: When your brand should use virtual influencers…and when they shouldn’t (PR Week)
  • Disturbing: A new software tool creates realistic nude images of women (The Verge
  • 🤔 Will robots really replace 47% of the jobs? Hint: it’s more nuanced than that. (The Economist
  • Official: In a letter to Senator Chris Coons (D-Delaware), Amazon said that it keeps transcripts and voice recordings from Alexa conversations indefinitely, and only removes them if they’re manually deleted by users (CNET)
  • Trippy: Scientists have designed a robot that resembles a jellyfish and moves like one, too (Wired)
  • Cool: Researchers in Japan are using deep learning to teach tree branches to walk…what?!?! (IEEE Spectrum)
Thank you for reading!
Subscribe here to Embodied AI! Make sure to forward Embodied AI to your friends and colleagues and tweet about us. Cheers!
Written by Nahua, edited by David, Will, Moritz, and Isaac. Illustrated by Anny.
Did you enjoy this issue?
Embodied AI - The AI Avatar Newsletter

Embodied AI is the definitive virtual beings newsletter. Sign up for the monthly digest of the latest news, technology, and trends behind AI avatars, virtual beings, and digital humans.

Written with love by Twenty Billion Neurons, an AI startup based in Berlin and Toronto.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue