In the
last issue of
Embodied AI, we argued in favor of transforming audio-based virtual assistants, such as Alexa, into
AI-powered avatars for ease of skill discovery and more humanlike interactivity. In short, start by equipping Alexa and Siri with
eyes on a screen.
Therefore, we are delighted to find out that both
Boris Katz, a principal researcher at
MIT who helped invent virtual assistants, and
Rohit Prasad, head scientist of
Alexa, share similar opinions about the current limitations to virtual assistants, i.e. common sense, situational awareness, and the important role of eyes for virtual assistants.
“Incredible progress…incredibly stupid”
That is quite harsh, but it is
how Katz thinks of Alexa, Siri, and other virtual assistants in his interview with
Technology Review’s
Will Knight: a conflicted feeling of pride and embarrassment. On the one hand, Katz is proud of the progress on and the adoption of virtual assistants. But on the other hand, he thinks these programs are “incredibly stupid”.
To be fair, Alexa and her likes are not stupid: they are rather a feat of software engineering with tremendous potential for improvement. But Katz’s candid opinions draw three important takeaways. First, Katz is dubious that training models on huge amounts of data would solve language understanding. Second, language understanding should not be isolated from other modalities like visual, tactile, and other sensory inputs. Third, common sense and intuitive physics are essential for virtual assistants.
Alexa Needs Eyes
But while Alexa can quickly access an encyclopedia-like knowledge base to respond to simple commands, the hack could only go so far. Prasad’s opinion is that “[the] only way to make smart assistants really smart is to give it eyes and let it explore the world.”
Recent news suggests that Amazon has already created versions of Alexa with a
camera and is betting on
home robotics for “mobile Alexa”. This is really exciting news. However, the
adjacent possible, our favorite framework, may suggest that robotics will take many more years before adding concrete value to users?