Sep 17, 2024

Award-winning Research Explores Risks of Using Synthetic Data to Train Facial Recognition Technology

Two UC Berkeley School of Information doctoral students have won a Best Paper Award at this year’s ACM Conference on Fairness, Accountability, and Transparency (FAact)

Held June 3–6 in Rio de Janeiro, the conference brought multidisciplinary researchers and practitioners interested in fairness, accountability, and transparency in socio-technical systems to investigate and tackle issues involving algorithmic systems fueled by big data. 

At the conference, Ph.D. in Information Science candidates Cedric Whitney and Justin Norman presented their research paper, which explores risks of using synthetic data to train facial recognition technology. This paper was later recognized with a Best Paper Award, which is given to authors whose work represents groundbreaking research in their respective areas. 

In the paper, the two address two important issues with using synthetic data to train large machine learning (ML) models: a high risk of false confidence in their datasets and circumvention of consent for data usage. Large-scale data collection can be costly and complicated to run due to ethical privacy restraints. As a result, researchers have turned to using synthetic data generation to create data. This AI generated data can be problematic in that their training models often rely on biased data—and subsequently stereotypes—to then create similarly-biased datasets of their own. 

The paper explores the difficulties in creating unbiased synthetic data, with tools that claim to do so making it easy to generate data that appears to be fair but is in reality biased. This paper begins by discussing the risks that became apparent when building out a synthetic dataset to evaluate facial recognition technologies. This includes diversity-washing, where the models appear to address criticism regarding a dataset’s distributions and representation in a superficial way, but without anything ensuring the data generated would be diverse and representative of real people, much less harmless in the long run.

Additionally, using synthetic data allows researchers to bypass the need for survey subjects to give their consent. The Federal Trade Commission (FTC) has required companies to delete machine learning models trained on improperly collected data as part of settlements around deceptive data collection, but using AI-generated data makes it possible to circumvent this rule and complicate government interference. As such, the use of synthetic data could risk blunting a component of US data protection and its implied effects on AI harm.

Last updated: September 17, 2024