Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

The Data is Higher Together community releases yet one more necessary dataset for open source development. As a result of the shortage of open preference datasets for text-to-image generation, we got down to release an Apache 2.0 licensed dataset for text-to-image generation. This dataset is targeted on text-to-image preference pairs across common image generation categories, while mixing different model families and ranging prompt complexities.

TL;DR? All results could be present in this collection on the Hugging Face Hub and code for pre- and post-processing could be present in this GitHub repository. Most significantly, there may be a ready-to-go preference dataset and a flux-dev-lora-finetune. If you need to show your support already, don’t forget to love, subscribe and follow us before you proceed reading further.

Unfamiliar with the Data is Higher Together community?

[Data is Better Together](https://huggingface.co/data-is-better-together) is a collaboration between 🤗 Hugging Face and the Open-Source AI community. We aim to empower the open-source community to construct impactful datasets collectively. You possibly can follow the organization to not sleep thus far with the most recent datasets, models, and community sprints.

Similar efforts

There have been several efforts to create an open image preference dataset but our effort is exclusive attributable to the various complexity and categories of the prompts, alongside the openness of the dataset and the code to create it. The next are among the efforts:

- [yuvalkirstain/pickapic_v2](https://huggingface.co/datasets/yuvalkirstain/pickapic_v2)
- [fal.ai/imgsys](https://imgsys.org/)
- [TIGER-Lab/GenAI-Arena](https://huggingface.co/spaces/TIGER-Lab/GenAI-Arena)
- [artificialanalysis image arena](https://artificialanalysis.ai/text-to-image/arena)

The input dataset

To get a correct input dataset for this sprint, we began with some base prompts, which we cleaned, filtered for toxicity and injected with categories and complexities using synthetic data generation with distilabel. Lastly, we used Flux and Stable Diffusion models to generate the pictures. This resulted within the open-image-preferences-v1.

Input prompts

Imgsys is a generative image model arena hosted by fal.ai, where people provide prompts and get to choose from two model generations to offer a preference. Sadly, the generated images aren’t published publicly, nevertheless, the associated prompts are hosted on Hugging Face. These prompts represent real-life usage of image generation containing good examples focused on day-to-day generation, but this real-life usage also meant it contained duplicate and toxic prompts, hence we had to take a look at the information and do some filtering.

Reducing toxicity

We aimed to remove all NSFW prompts and pictures from the dataset before starting the community. We settled on a multi-model approach where we used two text-based and two image-based classifiers as filters. Post-filtering, we decided to do a manual check of every one among the pictures to make sure that no toxic content was left, luckily we found our approach had worked.

We used the next pipeline:

Classify images as NSFW
Remove all positive samples
Argilla team manually reviews the dataset
Repeat based on review

Synthetic prompt enhancement

Data diversity is significant for data quality, which is why we decided to boost our dataset by synthetically rewriting prompts based on various categories and complexities. This was done using a distilabel pipeline.

Type	Prompt	Image
Default	a harp with none strings
Stylized	a harp without strings, in an anime style, with intricate details and flowing lines, set against a dreamy, pastel background
Quality	a harp without strings, in an anime style, with intricate details and flowing lines, set against a dreamy, pastel background, bathed in soft golden hour light, with a serene mood and wealthy textures, high resolution, photorealistic

Prompt categories

InstructGPT describes foundational task categories for text-to-text generation but there is no such thing as a clear equivalent of this for text-to-image generation. To alleviate this, we used two foremost sources as input for our categories: google/sdxl and Microsoft. This led to the next foremost categories: [“Cinematic”, “Photographic”, “Anime”, “Manga”, “Digital art”, “Pixel art”, “Fantasy art”, “Neonpunk”, “3D Model”, “Painting”, “Animation” “Illustration”]. On top of that we also selected some mutually exclusive, sub-categories to permit us to further diversify the prompts. These categories and sub-categories have been randomly sampled and are due to this fact roughly equally distributed across the dataset.

Prompt complexities

The Deita paper proved that evolving complexity and variety of prompts leads to raised model generations and fine-tunes, nevertheless, humans don’t all the time take time to put in writing extensive prompts. Due to this fact we decided to make use of the identical prompt in a fancy and simplified manner as two datapoints for various preference generations.

Image generation

The ArtificialAnalysis/Text-to-Image-Leaderboard shows an outline of the very best performing image models. We elect two of the very best performing models based on their license and their availability on the Hub. Moreover, we made sure that the model would belong to different model families so as to not highlight generations across different categories. Due to this fact, we selected stabilityai/stable-diffusion-3.5-large and black-forest-labs/FLUX.1-dev. Each of those models was then used to generate a picture for each the simplified and complicated prompt throughout the same stylistic categories.

The outcomes

A raw export of all the annotated data comprises responses to a multiple alternative, where each annotator selected whether either one among the models was higher, each models performed good or each models performed bad. Based on this we got to take a look at the annotator alignment, the model performance across categories and even do a model-finetune, which you’ll already play with on the Hub! The next shows the annotated dataset:

Annotator alignment

Annotator agreement is a method to check the validity of a task. Each time a task is just too hard, annotators won’t be aligned, and every time a task is just too easy they is likely to be aligned an excessive amount of. Striking a balance is rare but we managed to get it spot on during this sprint. We did this evaluation using the Hugging Face datasets SQL console. Overall, SD3.5-XL was a bit more more likely to win inside our test setup.

Model performance

Given the annotator alignment, each models proved to perform higher inside their very own right, so we did an extra evaluation to see if there have been differences across the categories. Briefly, FLUX-dev works higher for anime, and SD3.5-XL works higher for art and cinematic scenarios.

Tie: Photographic, Animation
FLUX-dev higher: 3D Model, Anime, Manga
SD3.5-XL higher: Cinematic, Digital art, Fantasy art, Illustration, Neonpunk, Painting, Pixel art

Model-finetune

To confirm the standard of the dataset, while not spending an excessive amount of time and resources we decided to do a LoRA fine-tune of the black-forest-labs/FLUX.1-dev model based on the diffusers example on GitHub. During this process, we included the chosen sample as expected completions for the FLUX-dev model and disregarded the rejected samples. Interestingly, the chosen fine-tuned models perform significantly better in art and cinematic scenarios where it was initially lacking! You possibly can test the fine-tuned adapter here.

Prompt	Original	Effective-tune
a ship within the canals of Venice, painted in gouache with soft, flowing brushstrokes and vibrant, translucent colours, capturing the serene reflection on the water under a misty ambiance, with wealthy textures and a dynamic perspective
A vibrant orange poppy flower, enclosed in an ornate golden frame, against a black backdrop, rendered in anime style with daring outlines, exaggerated details, and a dramatic chiaroscuro lighting.
Grainy shot of a robot cooking within the kitchen, with soft shadows and nostalgic film texture.

The community

Briefly, we annotated 10K preference pairs with an annotator overlap of two / 3, which resulted in over 30K responses in lower than 2 weeks with over 250 community members! The image leaderboard shows some community members even giving greater than 5K preferences. We would like to thank everyone that participated on this sprint with a special due to the highest 3 users, who will all get a month of Hugging Face Pro membership. Be sure that to follow them on the Hub: aashish1904, prithivMLmods, Malalatiana.

What’s next?

After one other successful community sprint, we’ll proceed organising them on the Hugging Face Hub. Be sure that to follow the Data Is Higher Together organisation to remain updated. We also encourage community members to take motion themselves and are glad to guide and reshare on socials and throughout the organisation on the Hub. You possibly can contribute in several ways:

Join and take part in other sprints.
Propose your individual sprints or requests for prime quality datasets.
Effective-tune models on top of the preference dataset. One idea can be to do a full SFT fine-tune of SDXL or FLUX-schnell. One other idea can be to do a DPO/ORPO fine-tune.
Evaluate the improved performance of the LoRA adapter in comparison with the unique SD3.5-XL and FLUX-dev models.

Source link

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community