A synthetic intelligence (AI) dataset that was reportedly removed on account of illegal content issues has resurfaced after being modified. The group that built the dataset said it had removed all problematic links. It’s drawing attention as the primary case of large-scale dataset modification amid growing concerns about illegality and copyright issues with AI learning data.
ARS Technica announced on the thirty first that the non-profit organization LAION has launched a brand new product called ‘Re-LAION-5B’. Open source the datasetIt was reported that this dataset is a modified version of ‘LION-5B’ that was deleted in December last yr.
LION-5B is generally known as probably the most popular representative datasets for training image generation AI. It is very famous for getting used by Stability AI to learn ‘stable diffusion’.
Nonetheless, in December last yr, the Stanford Web Observatory (SIO) revealed that the dataset contained 3,200 images suspected of being child sexual abuse (CSAM). Lion immediately removed the dataset and stopped distributing it.
To wash the dataset, we worked with the Web Watch Foundation (IWF) and the Canadian Centre for Child Protection (C3P) to remove 2,236 links that matched images in the web safety agency’s database.
Lion claims to have created “the primary web-scale text-image pair dataset that thoroughly cleans the CSAM links in query,” and says it has developed improved latest systems to discover and take away illegal content to reconstruct the dataset.
One other message to other dataset creators was sent: “Current state-of-the-art filters will not be reliable enough to guard against CSAM in web-scale data composition scenarios,” he said, urging them to work with expert organizations.
There was a variety of praise for this and likewise a variety of caution concerning the dataset.
Human Rights Watch welcomed Lion’s move, saying, “It’s now as much as the federal government to pass the Kid’s Data Protection Act to guard the web privacy of all children.”
“Datasets could be improved on all fronts, including privacy, copyright and illegal content, on account of the practice of scraping the web indiscriminately,” said Alex Champandar, co-founder of CreativeAI.
Particularly, Lion goals to advertise AI research by providing an open and transparent dataset, unlike closed models. Li-Lion-5B makes it easier for developers to know and manage the contents of the dataset.
“Datasets needs to be improved over time as a collaborative effort, and needs to be subject to ongoing review by the broader community,” Lion argued.
“If there’s something problematic within the dataset, it’s possible that the OpenAI model could produce problematic images, just like the Recent York Times claims, where it printed out a part of an article,” said Xenia Jesspre, co-founder and chief scientific officer of OpenAI.
Reporter Im Dae-jun ydj@aitimes.com