Building Inclusive Speech Technology with Diverse Data

Filipa P


Inclusive speech technology that is trained on diverse, accented speech data is the key to staying relevant in the voice recognition market.  

Three New Yorkers walk into a bar: one grew up in the Midwest in a Mexican family, another is a native Spanish speaker from Colombia, and the last, a New Yorker who spoke Castilian Spanish at home until high school. There’s no punchline here: they simply sit down and have a conversation in English.  

As they speak, we observe major differences in the speech of each person. Geography, socio- economic status, and ethnicity, among other factors, all cause variations in pronunciation, vocabulary, and other speech patterns. 

Given those differences, what happens when each of them goes home to their voice assistant and makes a request in English? How well is each of their accents understood? And what are the consequences for those who aren’t understood? 

Those are essential questions for data scientists, developers, and other AI speech professionals as they work to create products that are inclusive, diverse, and free from biases caused by an accent gap

Bridging the accent gap 

An accent gap is a type of algorithmic bias that occurs in voice recognition models that lack training with diverse, representative data, for example, models trained exclusively on English speech data sourced from a single geographic and cultural background. This “accent gap” can be frustrating to users who fall outside the narrow definition of an English speaker (predominantly white, upper-class male speakers), resulting in a product that doesn’t meet the needs of a diverse market.   

An accent gap can affect speech technology of all kinds. For example, one Washington Post study found that Amazon’s Alexa was 30% less likely to understand non-native English accents. In the same study, voice assistants from Google and other major competitors produced similar results.  

This means that to compete long-term in the voice recognition market, your model must understand accented speech. And when we say “models” we don’t just mean voice assistants. All models and devices that make up the Internet of Things (IoT), many of which use voice activation and recognition as part of their core offering, should be trained on diverse, representative, and bias-aware training data.  

By releasing a free Spanish-accented English speech dataset, DefinedCrowd aims to help AI professionals test whether their models present  accent gap for one specific group: non-native English speakers in the US whose native language is Spanish.  

Spanish-accented data 

Within the United States, there are more than 37 million Spanish speakers, making it the most spoken non-English language in the US. This number has grown by 233% since 1980, mostly due to immigration and the organic population growth in certain regions of the US.  

Spanish itself has many variations – there are approximately 577 million native Spanish speakers in the world, spread across 21 countries, each with their own distinct accents.  

As a result, addressing the accent gap in relation to Spanish accents is extremely complex and nuanced. Models must be trained on accented English taken from Spanish speakers all around the world, from a variety of Spanish-speaking countries.  

Free speech data from DefinedCrowd 

To continue the fight against this accent gap, DefinedCrowd is releasing free speech dataset, made up of data from Spanish-accented English speakers from all around the world.  

Language is constantly evolving and shifting, adapting to its environment and its users. As a result, voice assistants and IVR models must evolve as well, to stay relevant and competitive.  

Let’s build AI that drops the outdated model of what American English is supposed to sound like, and instead focuses on what it does sound like.   

Claim your free dataset here, by registering here on DefinedCrowd’s marketplace.