Choosing the Right Speech Training Data: The Importance of Diverse Data
Why Diversity in Data is a Key Element for a Successful ASR Model
Automatic Speech Recognition (ASR) models are becoming increasingly advanced in their abilities to understand and respond to human speech. However, their accuracy depends entirely on the way they are trained. In fact, the relationship between the quality of training data used and the performance of the model is pretty clear-cut: what you put in, you get out.
Whether you’re building an ASR model from scratch, or fine-tuning an existing one, quality data should include a balanced representation of different voices across gender, age groups, race and unique speakers. With the right mix of diverse data, your model will perform as envisaged: able to respond to the real world with all its variations.
In the following article, we’ll discuss:
- The basics of collecting speech training data.
- Why diversity in data is important.
- How data diversity contributes to reducing bias in AI.
- How to ensure diversity in training datasets.
Collecting Data for ASR Training
The key to successful and accurate ASR models is high-quality, validated training datasets. Depending on the intended use case of your model, you can choose to collect monologue speech datasets (one person speaking) or dialogue speech datasets (two people speaking to each other, or one person speaking to a machine, such as an Interactive Voice Response, IVR, system).
Within these two options, you can then collect scripted speech data, in which you provide a specific script that should be read, or spontaneous speech data, either guided by rules or scenarios, or completely off the cuff.
Whatever the appropriate choices are for your ASR, the speech data would then be collected, transcribed and validated, ready for use in ASR training. Read here for more information on speech data collection, and how we do it at DefinedCrowd.
Once you have decided which type of training data is required, it’s crucial to examine how you will ensure the data is of the highest quality. After all, garbage in results in garbage out. One major element of data quality is ensuring you are working with diverse datasets.
What is Diversity in Data and Why is it Important?
For an AI system to perform comparably across distinct classes of its user base, no matter age, race, gender, location, or ethnicity, it needs to be trained on high-quality, diverse data.
Diverse data includes representation of all genders, age groups, accents, ethnicities and any other factors that vary within the way people speak.
Diversity in data should become a priority for companies training an ASR model for several key reasons:
- Brand reputation: models that are unable to understand and respond to all users will damage the company’s brand image.
- Customer retention: if customers feel they are being disregarded, they will go to a competitor.
- Customer acquisition: by using diverse datasets, companies are ensuring valued customer segments are not being disregarded.
- Ethics in AI: On a higher level, diverse datasets start to address the larger issue of AI bias in ASR models and beyond.
Bias in AI: Can It Be Fixed?
The presence of bias in Artificial Intelligence (AI) models is to be expected, since it is humans building the algorithms and systems, and all humans are prejudiced in some way, shape, or form. However, this doesn’t excuse us from the responsibility of making an effort to correct these inherent biases.
That said, just as in humans, removing bias in AI is a complex issue without a single solution or a quick fix. The conversation is nuanced and complicated. Understanding the types of biases present (and how to mitigate them) is vital to enacting change.
Bias arises when ASR models are built upon training datasets that are not representative of the population that will actually be using the tool. People who fall outside the represented category of users are then “effectively censored”, i.e. unable to speak to the device and be understood.
This is problematic for many reasons. Besides ethical implications, biased models will negatively affect a business’s bottom line. Put simply, people aren’t likely to use a product that doesn’t understand them.
Of course, the way we speak forms a large part of our identity. Forcing someone to change the way they speak in order to use a technology raises serious concerns about assimilation and what society deems as acceptable. It is up to public and private institutions to ensure that no one needs to change the way they speak in order to be understood.
How to Ensure Diversity in Data
As mentioned, bias won’t be eliminated quickly or easily. However, we can take the first steps towards achieving this goal by ensuring diversity in datasets, focusing in particular on the following areas:
Gender
Gender and sex bias in AI has many implications, from perpetuating stereotypes of traditional roles of men/women (one translation tool, for example, translates gender neutral pronouns into the following: “he is a doctor, she is a nurse”) to assigning emotions where they don’t exist. In more consequential ways, bias can cause AI systems to dismiss qualified women candidates for certain job roles or, based on the same symptoms, suggest a less serious diagnosis for a woman than a man.
Gender bias as an issue in AI is perhaps made worse by the lack of women representation in AI. Research done by LinkedIn and the World Economic Forum shows that women make up only 22% of AI professionals around the world.
Race and Ethnicity
Race is another real-world bias reflected in AI. Multiple studies have shown that ASR systems are less responsive to non-white voices. For example, this study from Stanford University showed that Black speakers were almost twice as likely to be misunderstood than white speakers with a word error rate of 35% versus 19% respectively. This is due to the lack of representation of variations in English, such as African American Vernacular English (AAVE).
Accents/Colloquialisms
Even within a single state or region, the way one part of the population pronounces certain words or frames their sentences can vary greatly. This is why it’s important to ensure a diversity of accents are represented in your training data.
Data Diversity Through Crowdsourcing
The bottom line: data is no longer an obscure unit of information but is contextualized with the process and the agents who contributed to it.
“Crowdsourcing” data is a great way to improve the diversity of AI training datasets. Known contributors can be actively targeted to optimize diversity to train diverse models that speak to everyone, everywhere.
At DefinedCrowd, we have made diversity a core pillar in our product offering. With our global crowd of over 500,000 contributors and market-leading workflow automations, we are able to provide the diverse training data required to fuel speech recognition, natural language processing (NLP), and computer vision technologies.
Besides our large, global crowd (who represent over 50 languages and dialects from over 70 countries in the world), we are currently using (or working to implement) algorithms to ensure diversity of data in the following areas:
Gender: Automatically ensuring a mix of gender representation in each dataset.
Speaker uniqueness and consistency: Ensuring speaker diversity and metadata consistency, by automatically detecting different voices in a large dataset.
Diversity in Data is the Start
As we have seen, AI is capable of being a force for good, or if done wrong, a force for evil. Of the many ethical issues within AI, bias is high on the list. But it is a societal problem, only reflected in the AI we build.
The real solution to bias within speech recognition software is a long way off, as it requires a prejudice-free society, something that is unlikely to ever exist. However, recognizing the problem and implementing strategies to correct human-inherent bias is a big step in the right direction.
For a larger discussion on diversity in data, tune in to the upcoming discussion with DefinedCrowd’s CTO João Freitas.