DALL·E 2 pre-training mitigations

We observed that our internal predecessors to DALL·E 2 would sometimes reproduce training images verbatim. This behavior was undesirable, since we would really like DALL·E 2 to create original, unique images by default and never just “stitch together” pieces of existing images. Moreover, reproducing training images verbatim can raise legal questions around copyright infringement, ownership, and privacy (if people’s photos were present in training data).

To higher understand the difficulty of image regurgitation, we collected a dataset of prompts that incessantly resulted in duplicated images. To do that, we used a trained model to sample images for 50,000 prompts from our training dataset, and sorted the samples by perceptual similarity to the corresponding training image. Finally, we inspected the highest matches by hand, finding only a number of hundred true duplicate pairs out of the 50k total prompts. Regardless that the regurgitation rate gave the impression to be lower than 1%, we felt it was needed to push the speed right down to 0 for the explanations stated above.

Once we studied our dataset of regurgitated images, we noticed two patterns. First, the photographs were just about all easy vector graphics, which were likely easy to memorize resulting from their low information content. Second, and more importantly, the photographs all had many near-duplicates within the training dataset. For instance, there may be a vector graphic which looks like a clock showing the time 1 o’clock—but then we’d discover a training sample containing the identical clock showing 2 o’clock, after which 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to confirm that, indeed, the entire regurgitated images had perceptually similar duplicates within the dataset. Other works have observed the same phenomenon in large language models, finding that data duplication is strongly linked to memorization.

The above finding suggested that, if we deduplicated our dataset, we’d solve the regurgitation problem. To realize this, we planned to make use of a neural network to discover groups of images that looked similar, after which remove all but one image from each group.^{[^footnote-2]}

Nonetheless, this is able to require checking, for every image, whether it’s a replica of each other image within the dataset. Since our whole dataset comprises tons of of tens of millions of images, we’d naively need to envision tons of of quadrillions of image pairs to seek out all of the duplicates. While that is technically close by, especially on a big compute cluster, we found a way more efficient alternative that works almost as well at a small fraction of the cost.Consider what happens if we cluster our dataset before performing deduplication. Since nearby samples often fall into the identical cluster, a lot of the duplicate pairs wouldn’t cross cluster decision boundaries. We could then deduplicate samples inside each cluster without checking for duplicates outside of the cluster, while only missing a small fraction of all duplicate pairs. This is far faster than the naive approach, since we now not have to envision each pair of images.^{[^footnote-3]}

Once we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs when usingK=1024 clusters.To enhance the success rate of the above algorithm, we leveraged one key commentary: while you cluster different random subsets of a dataset, the resulting cluster decision boundaries are sometimes quite different. Due to this fact, if a replica pair crosses a cluster boundary for one clustering of the info, the identical pair might fall inside a single cluster in a special clustering. The more clusterings you are trying, the more likely you might be to find a given duplicate pair. In practice, we settled on using five clusterings, which suggests that we seek for duplicates of every image within the union of 5 different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.

Surprisingly, almost 1 / 4 of our dataset was removed by deduplication. Once we checked out the near-duplicate pairs that were found, a lot of them included meaningful changes. Recall the clock example from above: the dataset might include many images of the identical clock at different times of day. While these images are more likely to make the model memorize this particular clock’s appearance, they may also help the model learn to tell apart between times of day on a clock. Given how much data was removed, we were frightened that removing images like this might need hurt the model’s performance.

To check the effect of deduplication on our models, we trained two models with an identical hyperparameters: one on the complete dataset, and one on the deduplicated version of the dataset. To check the models, we used the identical human evaluations we used to judge our original GLIDE model. Surprisingly, we found that human evaluators barely preferred the model trained on deduplicated data, suggesting that the massive amount of redundant images within the dataset was actually hurting performance.

Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50k prompts from the training dataset. We found that the brand new model never regurgitated a training image when given the precise prompt for the image from the training dataset. To take this test one other step further, we also performed a nearest neighbor search over all the training dataset for every of the 50k generated images. This manner, we thought we’d catch the model regurgitating a special image than the one related to a given prompt. Even with this more thorough check, we never found a case of image regurgitation.

DALL·E 2 pre-training mitigations

What are your thoughts on this topic?
Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

AI’s Growing Power Needs: Tech Industry’s Move Towards Nuclear Power

“Human Intelligence Created”… Human Intelligence Challenge Spreads Against ‘Made by AI’

What We Still Don’t Understand About Machine Learning

OpenAI Unveils SearchGPT: A Recent AI-Powered Search Engine

Public Release: Kling AI Video Generator

DALL·E 2 pre-training mitigations

What are your thoughts on this topic? Let us know in the comments below.

2 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.