Machine Learning Essentials: What is Data Annotation?

Crystal Gilliam

.20.04.2021

Data annotation helps machines make sense of text, video, image or audio data.  

One of the stand-out characteristics of Artificial Intelligence (AI) and machine learning technology is its ability to learnfor better or for worse, with each task it performs. It’s this continually evolving process that sets AI apart from static, code-dependent software. 

And it’s precisely this ability that makes high-quality annotated data a crucial element in training representative, successful and bias-free AI models.  

Data annotation or data labeling is the process of labeling individual elements of training data (whether text, video, or images) to help machines understand what exactly is in that data. This annotated data is then applied during model training. 

Data annotation also plays a part in the larger quality control process of data collection, as well-annotated datasets become ground truth datasets: data that is held up as a gold standard and used to measure the quality of other datasets. 

Teaching Through Data 

The purpose of annotating data is to tell machine learning algorithms exactly what we want them to know. As James Whittaker, Evangelist at DefinedCrowd points out in his manifesto, teaching a machine to learn through annotation can be likened to teaching a toddler shapes and colors using flashcards; where the annotations are the flashcards, and annotators are the teacher. Give both machines and toddlers the basics, and they’ll take it from there, applying the knowledge in new and, sometimes unexpected, ways.    

Of course, this is the simplified version of how AI learns. In practice, machine learning algorithms need large volumes of correctly annotated data to learn how to perform a task – which can prove a challenge in practice. Companies must have the resources and time to collect and label data for their specific use case – sometimes in obscure languages or unique and highly technical domains.  

The following is a closer look at the different types of data annotation, how annotated data is used and why humans will continue to be an indispensable part of the data annotation process in the future.  

The Importance of Data Annotation 

Data annotation is especially important when considering the amount of unstructured data that exists in the digital world, in the form of text, images, video and audio. By most estimates, unstructured data accounts for anywhere from 80-90% of the data out there.  

Currently, most models are currently trained via structured or supervised learning, which relies on well-annotated data from humans to create training examples. It is this lack of structured training data that DefinedCrowd aims to address.  

Types of Data Annotation 

Because data comes in many different forms, there are several different types of data annotation, for either text, image or video-based datasets. Here is a breakdown of each of these three types of data annotation.  

The Written Word: Text Annotation  

There is an incredible amount of information within any given text dataset. Text annotation is used to segment the data in a way that helps machines recognize individual elements within it. Types of text annotation include:  

Named Entity Tagging: Single and Multiple Entities:

Named Entity Tagging or Named Entity Recognition helps identify individual entities within blocks of text such as “person” “sport” or “country”.   

This type of data annotation creates entity definitions, so that eventually a machine learning algorithm will always recognize that “Saint Louis” is a city, “Saint Patrick” is a historical figure, and “Saint Lucia” is a tropical island in the Caribbean.  

Sentiment Tagging:  

Humans use language in unique and varying ways to express thoughts – sentences or phrases can’t always be taken at face value. It’s necessary to read between the lines or consider context to understand the sentiment behind the phrase, which is why sentiment tagging is crucial to allowing machines decide if a selected text is positive, negative or neutral.   

In many cases, the sentiment of a sentence is clear: for example, “Super helpful experience with the customer support team!” is clearly positive. However, when the intent is less straight-forward or when sarcasm or other ambiguous speech is used, it becomes more difficult to discern true meaning. For example, “Great reviews for this place, but I can’t say I agree!”. This is where human annotation adds real value.  

Semantic Annotation:  

The intent or meaning of words can vary greatly depending on the context and within specific domains. Domain-specific jargon used in a technical conversation within the finance industry is very different from slang used between two friends on social media. Semantic annotation gives that extra context that machines need to truly understand the intent behind the text.  

More than Meets the Eye: Image Annotation

Image annotation helps machines understand what elements are present within an image. This can be done by using Image Bounding Boxes, in which elements of an image are labeled with basic bounding boxes, or through more advanced object tagging. 

Annotations in images can range from simple classifications (labeling the gender of people in an image, for example) to more complex details (for example, labeling if weather in a scene is rainy or sunny). 

Image classification is another way that images are annotated based on single or multi-level categories. In this case, an example would be images of mountains classified into one “Mountain” category.  

Movement Detected: Video annotation 

Video annotation works in similar ways to image annotation – using Bounding Boxes and other annotation methods, single elements within frames of a video are identified, classified, or even tracked across multiple frames. For example, tagging all the humans in a CCTV video as “Customer” or helping autonomous vehicles recognize objects along the road.  

Important Notes on Data Annotation 

Human vs. Machine  

While some data annotation can now be automated, the human-in-the-loop paradigm for data annotation is still the default, and humans play an integral role in ensuring that data is annotated properly. Humans can provide context, a deeper understanding of intent, adding overall value to the annotations.  

In-house versus outsourcing 

Data annotation is essential, but also resource-heavy and time-consuming. One report showed that data preparation and engineering tasks represent over 80% of the time spent on most machine learning projects.Organizations may often be faced with a decision: perform data annotation in-house or outsource it?  

There are some advantages to performing data annotation in-house. For one, you retain control and visibility over the data collection process. Secondly, with very niche or technical models, subject matter experts with the relevant knowledge may already be in-house.  

However, outsourcing data annotation to a third party is an excellent solution to some of the biggest challenges that come up when doing data annotation in-house, namely time, resources and quality. Third-party data annotation can help reach the scale, speed and quality needed to create effective training datasets, while complying with increasingly complex data privacy rules and requirements.  

Making Your Machine Smarter 

Data annotation is a key element of the data collection process and essential to helping machines reach their full potential. Machine learning models are certainly ready to take off on their own, but first, they need data annotation to show them the way.  

To learn more about data annotation at DefinedCrowd, have a look here.