Our vision model is an extension of classical image-based object detectors to videos. Similar to typical object detectors such as
YOLOv3, our vision model simultaneously localizes and classifies: it detects objects by putting bounding boxes around them and recognizes the object type of each bounding box (e.g. face, body, etc.).
Unlike image-based detectors, our model also understands the action being performed by each object. This means that it not only detects and distinguishes a human face from a human body but also determines, for instance, if a face is “smiling” or “talking” or if a body is “jumping” or “dancing”. This is made possible by training our model on
video data rather than static images.
Compared to image detectors, which only process two-dimensional images, our model can process the time dimension. On the architecture side, we achieve this by using 3D-convolutional layers instead of 2D-convolutional layers, which enables the neural network to extract the relevant motion features that are needed to recognize the action performed by each object.
This end-to-end approach provides us with a very effective tool for analyzing and understanding human actions. In total, our model is trained to detect over 1,100 types of human activities and hand gestures. Another approach that we experimented with consists of breaking localization and classification into two independent sub-tasks: first, we extract bounding boxes and crop around each object using an object detector; then, we classify the action performed in each crop using a second action recognition model.
Our experience tells us that the end-to-end approach has several benefits over this two-step approach:
-
Simplicity: only one deep learning model to maintain.
-
Multi-tasking / Transfer learning: solving both tasks at once should make the model better on each individual task.
-
Omniscience: the recognition of actions is not restricted to one individual in the scene; the model is aware of all the actions taking place at the same time.
-
Constant complexity: it does not matter how many people are in the scene. The computation required for multi-person recognition and action classification is as efficient as for one person.
-
Speed: recognizing and classifying human actions simultaneously saves inference time. This makes it possible to run the vision model in real-time on edge-devices like the iPad Pro.