A brand new method to edit or generate images

AI image generation — which relies on neural networks to create latest images from quite a lot of inputs, including text prompts — is projected to change into a billion-dollar industry by the tip of this decade. Even with today’s technology, in the event you desired to make a fantastic picture of, say, a friend planting a flag on Mars or heedlessly flying right into a black hole, it could take lower than a second. Nonetheless, before they will perform tasks like that, image generators are commonly trained on massive datasets containing hundreds of thousands of images which can be often paired with associated text. Training these generative models might be an arduous chore that takes weeks or months, consuming vast computational resources in the method.

But what if it were possible to generate images through AI methods without using a generator in any respect? That real possibility, together with other intriguing ideas, was described in a research paper presented on the International Conference on Machine Learning (ICML 2025), which was held in Vancouver, British Columbia, earlier this summer. The paper, describing novel techniques for manipulating and generating images, was written by Lukas Lao Beyer, a graduate student researcher in MIT’s Laboratory for Information and Decision Systems (LIDS); Tianhong Li, a postdoc at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL); Xinlei Chen of Facebook AI Research; Sertac Karaman, an MIT professor of aeronautics and astronautics and the director of LIDS; and Kaiming He, an MIT associate professor of electrical engineering and computer science.

This group effort had its origins in a category project for a graduate seminar on deep generative models that Lao Beyer took last fall. In conversations in the course of the semester, it became apparent to each Lao Beyer and He, who taught the seminar, that this research had real potential, which went far beyond the confines of a typical homework project. Other collaborators were soon brought into the endeavor.

The start line for Lao Beyer’s inquiry was a June 2024 paper, written by researchers from the Technical University of Munich and the Chinese company ByteDance, which introduced a brand new way of representing visual information called a one-dimensional tokenizer. With this device, which can be a form of neural network, a 256×256-pixel image might be translated right into a sequence of just 32 numbers, called tokens. “I wanted to grasp how such a high level of compression could possibly be achieved, and what the tokens themselves actually represented,” says Lao Beyer.

The previous generation of tokenizers would typically break up the identical image into an array of 16×16 tokens — with each token encapsulating information, in highly condensed form, that corresponds to a particular portion of the unique image. The brand new 1D tokenizers can encode a picture more efficiently, using far fewer tokens overall, and these tokens are capable of capture details about all the image, not only a single quadrant. Each of those tokens, furthermore, is a 12-digit number consisting of 1s and 0s, allowing for two¹² (or about 4,000) possibilities altogether. “It’s like a vocabulary of 4,000 words that makes up an abstract, hidden language spoken by the pc,” He explains. “It’s not like a human language, but we will still try to seek out out what it means.”

That’s exactly what Lao Beyer had initially got down to explore — work that provided the seed for the ICML 2025 paper. The approach he took was pretty straightforward. If you desire to discover what a selected token does, Lao Beyer says, “you possibly can just take it out, swap in some random value, and see if there may be a recognizable change within the output.” Replacing one token, he found, changes the image quality, turning a low-resolution image right into a high-resolution image or vice versa. One other token affected the blurriness within the background, while one other still influenced the brightness. He also found a token that’s related to the “pose,” meaning that, within the image of a robin, as an example, the bird’s head might shift from right to left.

“This was a never-before-seen result, as nobody had observed visually identifiable changes from manipulating tokens,” Lao Beyer says. The finding raised the potential of a brand new approach to editing images. And the MIT group has shown, actually, how this process might be streamlined and automatic, in order that tokens don’t should be modified by hand, one by one.

He and his colleagues achieved an excellent more consequential result involving image generation. A system able to generating images normally requires a tokenizer, which compresses and encodes visual data, together with a generator that may mix and arrange these compact representations as a way to create novel images. The MIT researchers found a method to create images without using a generator in any respect. Their latest approach makes use of a 1D tokenizer and a so-called detokenizer (also often known as a decoder), which might reconstruct a picture from a string of tokens. Nonetheless, with guidance provided by an off-the-shelf neural network called CLIP — which cannot generate images by itself, but can measure how well a given image matches a certain text prompt — the team was capable of convert a picture of a red panda, for instance, right into a tiger. As well as, they might create images of a tiger, or another desired form, starting completely from scratch — from a situation by which all of the tokens are initially assigned random values (after which iteratively tweaked in order that the reconstructed image increasingly matches the specified text prompt).

The group demonstrated that with this same setup — counting on a tokenizer and detokenizer, but no generator — they might also do “inpainting,” which implies filling in parts of images that had one way or the other been blotted out. Avoiding the usage of a generator for certain tasks may lead to a big reduction in computational costs because generators, as mentioned, normally require extensive training.

What may appear odd about this team’s contributions, He explains, “is that we didn’t invent anything latest. We didn’t invent a 1D tokenizer, and we didn’t invent the CLIP model, either. But we did discover that latest capabilities can arise if you put all these pieces together.”

“This work redefines the role of tokenizers,” comments Saining Xie, a pc scientist at Latest York University. “It shows that image tokenizers — tools often used simply to compress images — can actually do quite a bit more. The indisputable fact that a straightforward (but highly compressed) 1D tokenizer can handle tasks like inpainting or text-guided editing, without having to coach a full-blown generative model, is pretty surprising.”

Zhuang Liu of Princeton University agrees, saying that the work of the MIT group “shows that we will generate and manipulate the photographs in a way that is way easier than we previously thought. Mainly, it demonstrates that image generation generally is a byproduct of a really effective image compressor, potentially reducing the price of generating images several-fold.”

There could possibly be many applications outside the sector of computer vision, Karaman suggests. “As an illustration, we could consider tokenizing the actions of robots or self-driving cars in the identical way, which can rapidly broaden the impact of this work.”

Lao Beyer is considering along similar lines, noting that the extreme amount of compression afforded by 1D tokenizers lets you do “some amazing things,” which could possibly be applied to other fields. For instance, in the realm of self-driving cars, which is considered one of his research interests, the tokens could represent, as a substitute of images, different routes that a vehicle might take.

Xie can be intrigued by the applications that will come from these progressive ideas. “There are some really cool use cases this might unlock,” he says.

A brand new method to edit or generate images

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Why 90% Accuracy in Text-to-SQL is 100% Useless

An Introduction to AI Secure LLM Safety Leaderboard

Generative coding

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

A brand new method to edit or generate images

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.