Synthetic imagery sets recent bar in AI training efficiency

-

Data is the brand new soil, and on this fertile recent ground, MIT researchers are planting greater than just pixels. Through the use of synthetic images to coach machine learning models, a team of scientists recently surpassed results obtained from traditional “real-image” training methods. 

On the core of the approach is a system called StableRep, which does not just use any synthetic images; it generates them through ultra-popular text-to-image models like Stable Diffusion. It’s like creating worlds with words. 

So what’s in StableRep’s secret sauce? A method called “multi-positive contrastive learning.”

“We’re teaching the model to learn more about high-level concepts through context and variance, not only feeding it data,” says Lijie Fan, MIT PhD student in electrical engineering, affiliate of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), lead researcher on the work. “When multiple images, all generated from the identical text, all treated as depictions of the identical underlying thing, the model dives deeper into the concepts behind the pictures, say the article, not only their pixels.”

This approach considers multiple images spawned from similar text prompts as positive pairs, providing additional information during training, not only adding more diversity but specifying to the vision system which images are alike and that are different. Remarkably, StableRep outshone the prowess of top-tier models trained on real images, similar to SimCLR and CLIP, in extensive datasets.

“While StableRep helps mitigate the challenges of information acquisition in machine learning, it also ushers in a stride towards a recent era of AI training techniques. The capability to provide high-caliber, diverse synthetic images on command could help curtail cumbersome expenses and resources,” says Fan. 

The strategy of data collection has never been straightforward. Back within the Nineties, researchers needed to manually capture photographs to assemble datasets for objects and faces. The 2000s saw individuals scouring the web for data. Nonetheless, this raw, uncurated data often contained discrepancies in comparison to real-world scenarios and reflected societal biases, presenting a distorted view of reality. The duty of cleansing datasets through human intervention isn’t only expensive, but additionally exceedingly difficult. Imagine, though, if this arduous data collection might be distilled right down to something so simple as issuing a command in natural language. 

A pivotal aspect of StableRep’s triumph is the adjustment of the “guidance scale” within the generative model, which ensures a fragile balance between the synthetic images’ diversity and fidelity. When finely tuned, synthetic images utilized in training these self-supervised models were found to be as effective, if no more so, than real images.

Taking it a step forward, language supervision was added to the combination, creating an enhanced variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy but additionally displayed remarkable efficiency in comparison with CLIP models trained with a staggering 50 million real images.

Yet, the trail ahead is not without its potholes. The researchers candidly address several limitations, including the present slow pace of image generation, semantic mismatches between text prompts and the resultant images, potential amplification of biases, and complexities in image attribution, all of that are imperative to deal with for future advancements. One other issue is that StableRep requires first training the generative model on large-scale real data. The team acknowledges that starting with real data stays a necessity; nevertheless, when you may have a very good generative model, you may repurpose it for brand spanking new tasks, like training recognition models and visual representations. 

The team notes that they haven’t gotten around the necessity to begin with real data; it’s just that when you may have a very good generative model you may repurpose it for brand spanking new tasks, like training recognition models and visual representations. 

While StableRep offers a very good solution by diminishing the dependency on vast real-image collections, it brings to the fore concerns regarding hidden biases inside the uncurated data used for these text-to-image models. The selection of text prompts, integral to the image synthesis process, isn’t entirely free from bias, “indicating the essential role of meticulous text selection or possible human curation,” says Fan. 

“Using the newest text-to-image models, we have gained unprecedented control over image generation, allowing for a various range of visuals from a single text input. This surpasses real-world image collection in efficiency and flexibility. It proves especially useful in specialized tasks, like balancing image variety in long-tail recognition, presenting a practical complement to using real images for training,” says Fan. “Our work signifies a step forward in visual learning, towards the goal of offering cost-effective training alternatives while highlighting the necessity for ongoing improvements in data quality and synthesis.”

“One dream of generative model learning has long been to have the opportunity to generate data useful for discriminative model training,” says Google DeepMind researcher and University of Toronto professor of computer science David Fleet, who was not involved within the paper. “While we’ve got seen some signs of life, the dream has been elusive, especially on large-scale complex domains like high-resolution images. This paper provides compelling evidence, for the primary time to my knowledge, that the dream is becoming a reality. They show that contrastive learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to enhance myriad downstream vision tasks.”

Fan is joined by Yonglong Tian PhD ’22 as lead authors of the paper, in addition to MIT associate professor of electrical engineering and computer science and CSAIL principal investigator Phillip Isola; Google researcher and OpenAI technical staff member Huiwen Chang; and Google staff research scientist Dilip Krishnan. The team will present StableRep on the 2023 Conference on Neural Information Processing Systems (NeurIPS) in Latest Orleans.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x