How our problem with benchmarking is creating biased Artificial Intelligence
Popular datasets produced by a cadre of elite institutions dominate machine learning research, and the year-over-year trend is toward less diversity, according to a new UCLA and Google Research paper
As the focus on AI ethics mounts, the datasets we use as a standard to measure the quality of AI training data is coming under increasing scrutiny.
A[CG1] new paper from the University of California Los Angeles and Google Research has found that machine learning research is dominated by a handful of popular benchmark datasets (BMDs), most of which originate from a small cadre of elite academic, business, and government institutions. While this unofficial industry-wide standardization of BMDs may be beneficial for scientific progression, the paper’s authors argue their trending popularity and lack of diversity raises a series of ethical, social, and even political concerns for the development of AI and machine learning (ML).
The study was conducted on the Facebook-backed Papers With Code, and analyzed the dataset usage patterns of machine learning subcommunities dedicated to specific tasks (e.g., computer vision, Natural Language Processing, etc.), as well as the distribution of datasets used to benchmark models in those communities and their institutional origins.
Using the Gini coefficient, a metric traditionally used to measure inequality between nation states, the authors’ findings were that from 2015 – 2020, year-over-year use of these popular datasets increased from roughly 0.6 – 0.75, where higher values imply a less diverse spread. This increase comes at the exclusion of datasets that may be specialized or better suited for testing certain models but are less well-known and regarded, reducing incentives to use them, and naturally strengthening the BMDs’ influence and the institutions that created them.
Similarly, the number of institutions most responsible for producing the current crop of BMDs have become less diverse (from roughly 0.65 – 0.83 over 2015 – 2020), and in 2021 include only 12: UC Stanford, Microsoft, Princeton, Max Planck, Google, Chinese University of Hong Kong, AT&T, Toyota Technical Institute at Chicago, New York University, Georgia Tech, UC Berkley, and Facebook.
How did we get to this point?
According to the paper’s authors, the AI/ML community’s push toward standardization was a result of a need for easy agreement on universal metrics of progress, making it easier to push the technology forward. Following the “AI Winter” of the 1980s, where “government funders sought to more accurately assess the value received on grant,” popular and widely used benchmark datasets gave ML projects a way to “formalize a particular task through a dataset and an associated quantitative metric of evaluation.”
The benefits of standardizing benchmarking and key datasets are three-fold:
- Barriers to ML participation were lowered as it was (and remains) costly to collect, annotate, and curate datasets
- Boiling down the complexities of ML to a set of agreed-upon metrics allows ML communities to “easily align on the value of research contributions and assess whether progress is being made on a particular task”
- It has allowed ML researchers to “relax reliance on slower institutions for evaluating progress like peer-review, qualitative or heuristic evaluation, or theoretical integration”
Together, these benefits have allowed ML research to become what the authors call a “rapid discovery science,” where achieving progress is as fast as iterating on any given model and evaluation is an easily interpretable set of metrics.
The benefits and caveats of “State of the Art”
Most data scientists and machine learning engineers are familiar with the term “State of the Art” (SOTA), and how that phrase both commands attention and dictates model popularity within the community. Status is conferred to models that set or perform well relative to the current SOTA, as well as to the teams that create them. Without the adoption of a benchmark framework and the open-source datasets by which to evaluate new models or the motivating incentives of “scientific progress and renown,” the AI/ML community wouldn’t have as stable and potent an organizing principle guiding development, and arguably, much of the gains of this past decade.
However, we must be wary that achieving SOTA metrics doesn’t become an end unto itself, warping both its motivations and usefulness. As the paper’s authors indicate, “SOTA-chasing” results in issues unique to ML algorithms, where models can often be “right for the wrong reason,” an issue sometimes complicated further when SOTA-optimization tricks or shortcuts are employed, or if the BMD utilized is modified in some way that might make it better for a related task while justifying the continued use of a particular metric.
A large portion of the paper analyzes a kind of task drift, or the use of BMDs designed for a specific task on models built to solve tasks that are only tenuously related – e.g., benchmarking an image generation model on image recognition datasets. In these instances, questioning the validity of the metrics produced is justified – does this model’s metrics mean the same things as the metrics of previous models benchmarked in this fashion? Are they even evaluating for the same problems, and if not, can a model evaluated in such a way really be assumed to be SOTA or comparable?
Over-reliance on SOTA metrics can also lead to a narrowing of vision, often to the detriment of other forms of quantitative or qualitative analysis that may be more appropriate to the model being judged, precluding novel ideas and approaches. As in the above example, wouldn’t it be better to evaluate models on their own merits and the specific problem space they operate in? (It almost certainly would, however, there are tradeoffs and costs, as we’ll see below.)
Although SOTA and BMD standardization has helped propel AI/ML to its modern heights, as the previous two cases show, it comes at a price of reflection and innovation. This is only further complicated by the unspoken and perhaps unintentional narrowing of research priorities by the institutions responsible for producing these BMDs. They’re not immune to the lure of status conferred by the culture of SOTA and they’re often the most resourced and capable of producing widely used open-source datasets. If their datasets become the only accepted way to judge models, who’s to say they aren’t asserting their own agendas and priorities for the community? This aligns uncomfortably with concerns over situations where corporate and government interests intermingle – and often fund – research.
Long a topic in the AI ethics space,[CG2] training on biased or unrepresentative data[CG3] naturally leads to models maintaining and propagating those biases, potentially resulting in outputs that harm people or violate their rights. One of the more common bias concerns is in unrepresentative data that overwhelmingly captures and expresses a particular demographic’s culture or worldview at the expense of all others – most particularly that of the “white, male, Western” as the paper’s authors also note.
Given that 11 of the 12 top BMD producers are nominally Western in origin, it beggars the question as to whether this bias is baked into the BMD framework, and whether all of AI/ML is perpetuating this bias by continuing to benchmark with just a handful of datasets from an even smaller handful of mostly Western institutions.
In combination with the impulse to build models that benchmark well, this concern becomes more urgent when considering how the BMDs were compiled. As a potent example, the exceedingly popular image recognition dataset, ImageNet, has been discovered to contain many flawed annotations – much of which reflect sexism and racism on the part of its annotators. These deep-rooted biases are the result of poor moderation in the data collection and annotation phases which in ImageNet’s case was crowdsourced through the Amazon Mechanical Turk platform.
These biases can lead to disastrous, life-altering outcomes for those affected. As a well-known example, facial recognition models used in law enforcement, optimized for the handful of facial recognition BMDs that exist – many of which contain unrepresentative distributions of gender and race – run the risk of being “overfit” to those biases. Just because a model scores well by one metric doesn’t mean that the model is suitable for its intended purpose. Not only does it prove harmful to the underrepresented populations in the data, but it throws the notion of “progress” into question. Are we making progress if we continue to test by imperfect standards? Is there a better way?
Where do we go from here?
Despite the benefits of consistency and credibility that BMDs and SOTA provide, the authors have a point: their widespread adoption and vanishingly few producers “shape the types of questions that get asked and the algorithms that get produced… Current benchmarking practices offer a mechanism through which a small number of elite corporate, government, and academic institutions shape the research agenda and values of the field.”
That’s a status quo that is unsustainable. It’s also not the future that the AI/ML community envisage for this technology. While there’s little sign of these practices changing any time soon, it’s important that the authors formalize these issues for the wider AI/ML community in this paper. Doing so helps us see the implications of letting BMD diversity remain stagnant as the AI/ML field progresses, putting the impetus upon us all to live our principles and act. So, where can – or rather where must – we go now?
We do all that we can, where we can, as every little bit will help move us away from this unsustainable status quo. A potential fix proposed by the paper’s authors is developing alternate forms of evaluation, particularly ones that go beyond metrics and incorporate qualitative assessment and/or peer review. It’s a good and necessary solution that needs to be worked on now but will likely take time to develop and be accepted by the community. Another, potentially more immediate solution would be to build datasets specific to the models and problems they’re attempting to solve.
As any data scientist will attest, getting good data is a non-trivial task, and the costs of building a large, widely used dataset that can be open-sourced is expensive. On the other hand, is continuing to use a handful of open-source datasets, especially if they’re flawed or not built for the express purpose of a given model’s problem domain really the best option? Beyond deep learning’s endless appetite for ever more data, perhaps the solution isn’t to gather more flawed data. Perhaps it’s to use better data – “better” here being data more specific to our models’ tasks and of higher quality and representativeness.
Complex issues, as always, call for nuanced and incremental fixes, but they’re fixes we need to commit to and get started on now. Be that as it may, it proves the classic adage that “anything worth doing is going to be difficult,” and it’s Defined.ai’s position that a future that realizes equitable and ethical AI is worth the effort.
At Defined.ai we’re experts in ethically collecting data via our crowd platform comprising thousands of vetted expert data professionals, and in providing our clients with fully transparent, high-quality data specific to their AI/ML needs. As far as data products and services go, place ethics, transparency, and quality at the center of everything we do.
We’re doing our part to support ethical AI development by providing the high-quality, unbiased data necessary to train and benchmark models, and we provide the metadata to prove it. We also offer services such as Evaluation of Experience that help you determine the quality of your models for your intended audience, ensuring that you not only have your model metrics to fall back on, but also feedback from real people.