What is Video Annotation in Machine Learning?

Filipa P

.25.05.2021

Powering computer vision for better machine learning models  

In the fast-moving world of sports, multi-million dollar deals to secure top athletes are made every day. On top of that, teams pour huge amounts of money into recruiting and investing in future talent. Finding this talent used to be mostly based on agents’ intuition, backed by basic statistics, but thanks to AI, this is fast changing. Now computer vision models are giving teams access to information they have never had before, analyzing hours of video footage to break down past performance and even predict future ones. Those teams progressive enough to use the technology are gaining a huge competitive edge.

To give machines the power to analyze and make sense of videos, machine learning teams train their computer vision models on annotated data. Video annotation is a type of data annotation that labels images from a video into predefined categories, which enables AI models to identify objects, track them and even predict their behaviors.  

The potential applications for computer vision are numerous and go far beyond what we have described above. It also teaches self-driving cars to stay in their lane (and on the road) and to recognize essential road objects such as traffic lights and pedestrians.  

In manufacturing, computer vision is vastly improving efficiencies by detecting defects, counting stock, and maintaining inventory status in warehouses. In healthcare, it is helping healthcare professionals detect and diagnose conditions such as cancer earlier than before, potentially improving quality of life and even saving lives.  

But to be effective and achieve these astonishing results, computer vision needs annotations – and lots of them. Image and video annotations identify objects and other parts of the image or video frame, helping deep learning algorithms understand what they are seeing.   

Video annotation comes with some challenges, the main one being the vast amounts of data within videos – much more than a single image. However, this is also an advantage of properly annotated video data as it can provide large volumes of training data for computer vision models. 

In the following, we’ll take a closer look at the types of video annotation, applications in the real world, and why annotating video with a crowd is an effective way to train your computer vision model. 

Types of video annotation  

Depending on the final application, there are various ways that video data can be annotated. They include:  

2D & 3D Cuboid Annotations:
These annotations create a 2D or 3D cube around an area to be annotated, allowing for the precise annotation of images and video frames. 

Polygon Lines: 
This type of video annotation is used to annotate objects through pixels – and including only those that belong to a specific object.  

Bounding Boxes: 
These annotations are used in image and video, as boxes are drawn tightly around the edges of each object in a frame.  

Semantic Segmentations and Annotations: 
Done at a pixel level, semantic annotations are a precise segmentation in which every pixel in an image or video frame is assigned to a class.  

Landmark annotations: 
Most effectively used for facial recognition, landmark annotations select specific parts of an image or video to track.  

Keypoint tracking:
A technique that predicts and tracks the location of a person or object. This is done by looking at a combination of the pose and the orientation of a given person/object. 

Object Detection, tracking and identification: 
This annotation gives the ability to detect object on the line and determine proper positioning of object: defect/non-defect (quality control in food packaging, for example).

In the Real World: Examples of Video Annotation and Use Cases  

Transportation:  
Beyond self-driving cars, video annotation is used in computer visions applications in all aspects of the transportation industry. From identifying traffic conditions to creating smarter public transportation systems, video annotation provides the identifying information for cars and other objects on the road and how they all interact.  

Manufacturing: 
Within manufacturing, video annotation helps computer vision models with quality control tasks. AI can identify defects along a manufacturing line, leading to impressive cost savings as compared to manual inspections. Computer vision can also do fast safety measure reviews, checking that humans are wearing the right safety equipment and helping to identify malfunctioning machines before they become a safety hazard.  

Sports industry:
The success of any sports team goes far beyond simply who wins and loses – the secret is to know why. Teams and clubs across sports are using computer vision to provide next-level analytics by analyzing past performances to predict future outcomes.  

And video annotation helps train these computer visions models by identifying individual elements in a video – from the ball to individual players on the field. Further applications in sports include use by sports broadcasters, companies analyzing the engagement of the crowd and improve the safety of high-speed sports like NASCAR races.  

Security:
The main uses of computer vision in security revolve around facial recognition. When used carefully, facial recognition can help unlock the world, from unlocking a smartphone to authorizing financial transactions. 

How to annotate video   

While there are plenty of tools out there that organizations can use to annotate a video, these are difficult to scale. Using the power of the crowd through crowdsourcing is an effective way to obtain the large number of annotations needed to train a computer vision model, especially when annotating video and the vast amount of data it holds within.  In crowdsourcing, annotation jobs are divided into thousands of micro-tasks, which thousands of contributors complete.

Crowd video annotation works in a similar way to other crowdsourced data collection. Qualified members of the crowd are selected and then invited to complete tasks within a collection job. The client identifies the type of video annotation needed from the above list and crowd members are given job instructions, completing tasks until a sufficient amount of data is collected. Annotations are then checked for quality.  

DefinedCrowd Quality

At DefinedCrowd, we use a series of metrics at the job level and the crowd level and guarantee quality data collection. With quality standards such as gold standard datasets, inter-annotator agreements, human-in-the-loop processes and qualification tests, we ensure that each crowd contributor is highly qualified to complete the job, and that each job is producing the quality video annotation outputs needed.  

A Future for Computer Vision 

Computer vision is quickly making its mark across industries in new and unexpected ways. There will likely be a future in which we begin to rely on computer vision at various moments throughout our days. However, to get there, we need to first train machines to see the world through a human’s eyes.  

For more about DefinedCrowd’s Computer Vision services, find more information here.