A significant AI training data set comprises tens of millions of examples of private data

-

The underside line, says William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University and one in all the coauthors, is that “anything you set online can [be] and possibly has been scraped.”

The researchers found hundreds of instances of validated identity documents—including images of bank cards, driver’s licenses, passports, and birth certificates—in addition to over 800 validated job application documents (including résumés and canopy letters), which were confirmed through LinkedIn and other web searches as being related to real people. (In lots of more cases, the researchers didn’t have time to validate the documents or were unable to due to issues like image clarity.) 

Plenty of the résumés disclosed sensitive information including disability status, the outcomes of background checks, birth dates and birthplaces of dependents, and race. When résumés were linked to individuals with online presences, researchers also found contact information, government identifiers, sociodemographic information, face photographs, home addresses, and the contact information of other people (like references).

Examples of identity-related documents present in CommonPool’s small-scale data set show a bank card, a Social Security number, and a driver’s license. For every sample, the kind of URL site is shown at the highest, the image in the center, and the caption in quotes below. All personal information has been replaced, and text has been paraphrased to avoid direct quotations. Images have been redacted to indicate the presence of faces without identifying the individuals.

COURTESY OF THE RESEARCHERS

When it was released in 2023, DataComp CommonPool, with its 12.8 billion data samples, was the most important existing data set of publicly available image-text pairs, which are sometimes used to coach generative text-to-image models. While its curators said that CommonPool was intended for tutorial research, its license doesn’t prohibit industrial use as well. 

CommonPool was created as a follow-up to the LAION-5B data set, which was used to coach models including Stable Diffusion and Midjourney. It draws on the identical data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022. 

While industrial models often don’t disclose what data sets they’re trained on, the shared data sources of DataComp CommonPool and LAION-5B mean that the information sets are similar, and that the identical personally identifiable information likely appears in LAION-5B, in addition to in other downstream models trained on CommonPool data. CommonPool researchers didn’t reply to emailed questions.

And since DataComp CommonPool has been downloaded greater than 2 million times over the past two years, it is probably going that “there [are]many downstream models which can be all trained on this exact data set,” says Rachel Hong, a PhD student in computer science on the University of Washington and the paper’s lead writer. Those would duplicate similar privacy risks.

Good intentions should not enough

“You possibly can assume that any large-scale web-scraped data at all times comprises content that shouldn’t be there,” says Abeba Birhane, a cognitive scientist and tech ethicist who leads Trinity College Dublin’s AI Accountability Lab—whether it’s personally identifiable information (PII), child sexual abuse imagery, or hate speech (which Birhane’s own research into LAION-5B has found). 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x