View profile

A trip inside a virtual being’s brain (with video demos)

A trip inside a virtual being’s brain (with video demos)

A trip inside an AI avatar's brain (Illustrated by Luniapilot)
A trip inside an AI avatar's brain (Illustrated by Luniapilot)
The photo below taken by VentureBeat’s Dean Takahashi at the first-ever Virtual Beings Summit two weeks ago says it all about this week’s issue of EmbodiedAI: “AI makes a virtual being possible”. Over the past few weeks, several of our readers reached out to me interested in learning more about the intelligence behind AI avatars, such as Millie.
“AI makes a virtual being possible” (Credit: Dean Takahashi @Virtual Beings Summit)
“AI makes a virtual being possible” (Credit: Dean Takahashi @Virtual Beings Summit)
This week, I am excited to take you on a trip deep inside the brain of TwentyBN’s AI avatar, Millie in collaboration with my colleague Guillaume Berger, TwentyBN’s principal AI researcher.
3 video demos are embedded below👇
The Brain of an AI Avatar
Millie’s brain is powered by a deep learning model trained on over 3 million proprietary video data collected by TwentyBN. Since our motto is to teach machines to perceive the world like humans, Millie’s brain has two major components, a vision model and an audio model, resembling how our brains process the sights and sounds we experience everyday.
Vision Model: Understanding human behavior
20BN Human Behavior Understanding Demo on Vimeo
20BN Human Behavior Understanding Demo on Vimeo
Our vision model is an extension of classical image-based object detectors to videos. Similar to typical object detectors such as YOLOv3, our vision model simultaneously localizes and classifies: it detects objects by putting bounding boxes around them and recognizes the object type of each bounding box (e.g. face, body, etc.). 
Unlike image-based detectors, our model also understands the action being performed by each object. This means that it not only detects and distinguishes a human face from a human body but also determines, for instance, if a face is “smiling” or “talking” or if a body is “jumping” or “dancing”. This is made possible by training our model on video data rather than static images.
Compared to image detectors, which only process two-dimensional images, our model can process the time dimension. On the architecture side, we achieve this by using 3D-convolutional layers instead of 2D-convolutional layers, which enables the neural network to extract the relevant motion features that are needed to recognize the action performed by each object.
This end-to-end approach provides us with a very effective tool for analyzing and understanding human actions. In total, our model is trained to detect over 1,100 types of human activities and hand gestures. Another approach that we experimented with consists of breaking localization and classification into two independent sub-tasks: first, we extract bounding boxes and crop around each object using an object detector; then, we classify the action performed in each crop using a second action recognition model.
Our experience tells us that the end-to-end approach has several benefits over this two-step approach:
  • Simplicity: only one deep learning model to maintain.
  • Multi-tasking / Transfer learning: solving both tasks at once should make the model better on each individual task.
  • Omniscience: the recognition of actions is not restricted to one individual in the scene; the model is aware of all the actions taking place at the same time.
  • Constant complexity: it does not matter how many people are in the scene. The computation required for multi-person recognition and action classification is as efficient as for one person.
  • Speed: recognizing and classifying human actions simultaneously saves inference time. This makes it possible to run the vision model in real-time on edge-devices like the iPad Pro.
Audio Model: Speech-to-Intent
20BN Speech-To-Intent Demo on Vimeo
20BN Speech-To-Intent Demo on Vimeo
Previously relying on cloud solutions, we now have our own at-the-edge audio model for the avatar to understand not only what people do but also what they say. Unlike a typical yet complicated pipeline (see graph below), we are inspired by how humans process speech and went for the most direct and end-to-end approach: speech-to-intent. In this approach, one does not need to listen and transcribe every single word to understand the meaning of another person. A variety of speech-to-intent approaches have been explored (such as this). Similar to the Jasper model, the architecture of our model is fully convolutional.
Roland, TwentyBN’s CEO, favors the speech-to-intent approach:
“We are an intent-capturing company so going straight to human intent is how we roll at TwentyBN. By developing in-house audio capabilities for our AI avatar, we are simply transferring our success in video understanding to speech recognition. This is possible thanks to the world’s largest crowd-acting platform we’ve built, which allows us to rapidly source intent-based data at scale.”
3 benefits of the speech-to-intent approach include:
  • Robustness: in our experiments with different approaches to speech recognition, speech-to-intent ends up being more robust against noises, accents, and variations in people’s pronunciation of words.
  • Light-weight: the neural network can easily run on small embedded devices in real time, such as a tablet and a smartphone.
  • Reactiveness: running the audio model on premise reduces latency drastically as compared to using cloud solutions, significantly improving the interaction and overall user experience with AI avatars like Millie.
20BN Millie Dialogue Demo on Vimeo (extremely low latency)
20BN Millie Dialogue Demo on Vimeo (extremely low latency)
SenseNet: an Audio-Visual Model
As we speak, TwentyBN’s researchers and engineers are working on a new deep learning model called SenseNet, which fuses our vision and audio models into a single audio-visual model for intent classification.
SenseNet is crucial in bringing virtual beings from highly restricted environments into “the wild” (think of a seeing Alexa in a crowded supermarket). AI, just like humans, will need to rely on visual cues, that reveal the origin of a spoken utterance or the context of a scene. Merging vision with audio is a necessary step to achieve high intent classification accuracy in a messy environment.
We hope you have enjoyed this trip inside the brain of a virtual being. Let us know if you have any comments or questions and make sure you share this article with your friends!
  • Announced at the first Virtual Beings Summit, virtual beings grants will help push AI avatar creation forward (VentureBeat)  
  • Your very own guide to the world of virtual beings and their impact - slight cost! (TechCrunch)
  • OpenAI is chasing Artificial General Intelligence with the help of Microsoft and $1 Billion (NY Times
  • There’s a new deepfake detector to help spot false videos (Mashable
  • An Israeli AI startup is using computer vision to perform automatic vehicle inspections (The Engineer)
  • Google’s Pixel 4 will have gesture recognition capabilities - see ya later touchscreen! (Wired
  • Alexa can now talk to you at your preferred speed (CNET)
  • AI is helping Toronto’s homeless get 24 hour support (Verdict)
  • Learn how Alexa is powering new voice-controlled video games (MIT Tech Review)
  • Alexa will soon be able to answer more complex questions (TechCrunch)
Thank you for reading!
Subscribe here to Embodied AI! Make sure to forward Embodied AI to your friends and colleagues and tweet about us. Cheers!
Written by Guillaume and Nahua. Edited by Roland, David, Antoine, and Moritz. Illustrated by Anny.
Did you enjoy this issue?
Embodied AI - The AI Avatar Newsletter

Embodied AI is the definitive virtual beings newsletter. Sign up for the monthly digest of the latest news, technology, and trends behind AI avatars, virtual beings, and digital humans.

Written with love by Twenty Billion Neurons, an AI startup based in Berlin and Toronto.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue