Training Design for Text-to-Image Models: Lessons from Ablations

Welcome back! That is the second a part of our series on training efficient text-to-image models from scratch.
Within the first post of this series, we introduced our goal: training a competitive text-to-image foundation model entirely from scratch, within the open, and at scale. We focused totally on architectural decisions and motivated the core design decisions behind our model PRX.
We also released an early, small (1.2B parameters) version of the model as a preview of what we’re constructing (go try it for those who have not already 😉).

On this post, we shift our focus from architecture to training. The goal is to document what actually moved the needle for us when attempting to make models train faster, converge more reliably, and learn higher representations. The sector is moving quickly and the list of “training tricks” keeps growing, so somewhat than attempting an exhaustive survey, we structured this as an experimental logbook: we reproduce (or adapt) a set of recent ideas, implement them in a consistent setup, and report how they affect optimization and convergence in practice. Finally, we don’t only report these techniques in isolation; we also explore which of them remain useful when combined.

In the subsequent post, we are going to publish the complete training recipe as code, including the experiments on this post. We will even run and report on a public “speedrun” where we put the very best pieces together right into a single configuration and stress-test it end-to-end.
This exercise will serve each as a stress test of our current training pipeline and as a concrete demonstration of how far careful training design can go under tight constraints.
In the event you haven’t already, we invite you to hitch our Discord to proceed the discussion. A major a part of this project has been shaped by exchanges with community members, and we place a high value on external feedback, ablations, and alternative interpretations of the outcomes.

The Baseline

Before introducing any training-efficiency techniques, we first establish a clean reference run. This baseline is intentionally easy. It uses standard components, avoids auxiliary objectives, and doesn’t depend on architectural shortcuts or tricks to save lots of compute resources. Its role is to function a stable point of comparison for all subsequent experiments.
Concretely, this can be a pure Flow Matching (Lipman et al., 2022) training setup (as introduced in Part 1) with no extra objectives and no architectural speed hacks.
We are going to use the small PRX-1.2B model we presented in the primary post of this series (single stream architecture with global attention for the image tokens and text tokens) as our baseline and train it in Flux VAE latent space, keeping the configuration fixed across all comparisons unless stated otherwise.
The baseline training setup is as follows:

Setting	Value
Steps	100k
Dataset	Public 1M synthetic image generated with MidJourneyV6
Resolution	256×256
Global batch size	256
Optimizer	AdamW
lr	1e-4
weight_decay	0.0
eps	1e-15
betas	(0.9, 0.95)
Text encoder	GemmaT5
Positional encoding	Rotary (RoPE)
Attention mask	Padding mask
EMA	Disabled

This baseline configuration provides a transparent and reproducible anchor. It allows us to attribute observed improvements and regressions to specific training interventions, somewhat than to shifting hyperparameters or hidden setup changes.
Throughout the rest of this post, every technique is evaluated against this reference with a single guiding query in mind:

Does this modification improve convergence or training efficiency relative to the baseline?

Examples of baseline model generations after 100K training steps.

Benchmarking Metrics

To maintain this post grounded, we depend on a small set of metrics to observe checkpoints over time. None of them is an ideal proxy for perceived image quality, but together they supply a practical scoreboard while we iterate.

Fréchet Inception Distance (FID): (Heusel et al., 2017) Measures how close the distributions of generated and real images are, using Inception-v3 feature statistics (mean and covariance). Lower values typically correlate with higher sample fidelity.
CLIP Maximum Mean Discrepancy (CMMD): (Jayasumana et al., 2024) Measures the space between real and generated image distributions using CLIP image embeddings and Maximum Mean Discrepancy (MMD). Unlike FID, CMMD doesn’t assume Gaussian feature distributions and will be more sample-efficient; in practice it often tracks perceptual quality higher than FID, though it remains to be an imperfect proxy.
DINOv2 Maximum Mean Discrepancy (DinoMMD): Same MMD-based distance as CMMD, but computed on DINOv2 (Oquab et al. 2023) image embeddings as a substitute of CLIP. This provides a complementary view of distribution shift under a self-supervised vision backbone.
Network throughput: Average variety of samples processed per second (samples/s), as a measure of end-to-end training efficiency.

With the scoreboard defined, we are able to now dive into the methods we explored, grouped into 4 buckets: Representation Alignment, Training Objectives, Token Routing and Sparsification, and Data.

Representation Alignment

Diffusion and flow models are typically trained with a single objective: predict a noise-like goal (or vector field) from a corrupted input. Early in training, that one objective is doing two jobs without delay: it must construct a useful internal representation and learn to denoise on top of it. Representation alignment makes this explicit by keeping the denoising objective and adding an auxiliary loss that directly supervises intermediate features using a robust, frozen vision encoder. This tends to hurry up early learning and convey the model’s features closer to those of contemporary self-supervised encoders. Because of this, you frequently need less compute to hit the identical quality.

A useful technique to view it’s to decompose the denoiser into an implicit encoder that produces intermediate hidden states, and a decoder that maps those states to the denoising goal. The claim is that representation learning is the bottleneck: diffusion and flow transformers do learn discriminative features, but they lag behind foundation vision encoders when training is compute-limited. Due to this fact, borrowing a robust representation space could make the denoising problem easier.

REPA (Yu et al., 2024)

_{Representation alignment with a pre-trained visual encoder. Figure from arXiv:2410.06940.}

REPA adds a representation matching term on top of the bottom flow-matching objective.
Let $x_0 sim p_{text{data}}$

This term is combined with the important flow-matching loss:

$mathcal{L} = mathcal{L}_{text{FM}} + lambda,mathcal{L}_{text{REPA}}$

with $λ lambda$

In practice, the coed is trained to supply noise-robust, data-consistent patch representations from $x_{t}$

What we observed

We ran REPA on top of our baseline PRX training, using two frozen teachers: DINOv2 and DINOv3 (Siméoni et al., 2025). The pattern was very consistent: adding alignment improves quality metrics, and the stronger teacher helps more, at the fee of a little bit of speed.

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.2	0.41	0.39	3.95
REPA-Dinov3	14.64	0.35	0.3	3.46
REPA-Dinov2	16.6	0.39	0.31	3.66

On the standard metrics, each teachers improve over the baseline. The effect is strongest with DINOv3, which achieves the very best overall numbers on this run.

REPA just isn’t free: we pay for an additional frozen teacher forward and the patch-level similarity loss, which shows up as a throughput drop from 3.95 batches/s to 3.66 (DINOv2) or 3.46 (DINOv3). In other words, DINOv3 prioritizes maximum representation quality at the fee of slower training, while DINOv2 offers a more efficient tradeoff, still delivering substantial gains with a smaller slowdown.

Our practical takeaway is that REPA is a robust lever for text-to-image training. In our setup, the throughput trade-off is real and the net speedup (time required to achieve a given level of image quality) felt a bit less dramatic than what the authors of the paper report on ImageNet-style, class-conditioned generation. That said, the quality gains are still clearly significant. Qualitatively, we also saw the difference early: after ~100K steps, samples trained with alignment tended to lock in cleaner global structure and more coherent layouts, which makes it easy to see why REPA (and alignment variants more broadly) have turn out to be a go-to ingredient in modern T2I training recipes.

Baseline	Repa-DinoV2	Repa-DinoV3

iREPA (Singh et al., 2025)

A natural follow-up to REPA is: what exactly should we be aligning? iREPA argues that the reply is spatial structure, not global semantics. Across a big sweep of 27 vision encoders, the authors find that ImageNet-style “global” quality (e.g., linear-probe accuracy on patch tokens) is simply weakly predictive of downstream generation quality under REPA, while easy measures of patch-token spatial self-similarity correlate far more strongly with FID. Based on that diagnosis, iREPA makes two tiny but targeted changes to the REPA recipe to raised preserve and transfer spatial information:

Replace the standard MLP projection head with a light-weight 3×3 convolutional projection operating on the patch grid.
Apply a spatial normalization to teacher patch tokens that removes a worldwide overlay (mean across spatial locations) to extend local contrast.

Despite representing “lower than 4 lines of code”, these tweaks consistently speed up convergence and improve quality across encoders, model sizes, and even REPA-adjacent training recipes.

What we observed

In our setup, we observed the same sort of boost when applying the iREPA spatial tweaks on top of DINOv2: convergence was a bit smoother and the metrics improved more steadily over the primary 100K steps. Interestingly, the identical changes did not transfer as cleanly when applied on top of a DINOv3 teacher they usually tended to degrade performance somewhat than help. We don’t need to over-interpret that result: this might easily be an interaction with our specific architecture, resolution/patching, loss weighting, and even small implementation details. Still, given this inconsistency across teachers, we are going to likely not include these tweaks in our default recipe, even when they continue to be an interesting choice to revisit when tuning for a selected setup.

About Using REPA In the course of the Full Training:

The paper REPA Works Until It Doesn’t: Early-Stopped, Holistic Alignment Supercharges Diffusion Training (Wang et al., 2025) highlights a key caveat: REPA is a robust early accelerator, but it could possibly plateau and even turn out to be a brake later in training. The authors describe a capability mismatch. Once the generative model starts fitting the complete data distribution (especially high-frequency details), forcing it to remain near a frozen recognition encoder’s lower-dimensional embedding manifold becomes constraining. Their practical takeaway is straightforward: keep alignment for the “burn-in” phase, then turn it off with a stage-wise schedule.

We observed the identical qualitative pattern in our own runs. When training our preview model, removing REPA after ~200K steps noticeably improved the overall feel of image quality, textures, micro-contrast, and fantastic detail continued to sharpen as a substitute of looking barely muted. For that reason, we also recommend treating representation alignment as a transient scaffold. Use it to get fast early progress, then drop it after some time once the model’s own generative features have caught up.

Alignment within the Token Latent Space

Up to now, “alignment” meant regularizing the generator’s internal features against a frozen teacher while treating the tokenizer / latent space as fixed. A more direct lever is to shape the latent space itself so the representation presented to the flow backbone is intrinsically easier to model, without sacrificing the reconstruction fidelity needed for editing and downstream workflows.

REPA-E (Leng et al., 2025) makes this concrete. Its place to begin is a failure mode: for those who simply backprop the diffusion / flow loss into the VAE, the tokenizer quickly learns a pathologically easy latent for the denoiser, which might even degrade final generation quality. REPA-E’s fix is a two-signal training recipe:

keep the diffusion loss, but apply a stop-gradient so it only updates the latent diffusion model (not the VAE);
update each the VAE and the diffusion model using an end-to-end REPA alignment loss.

Because of these two tricks, the tokenizer is explicitly optimized to supply latents that yield higher alignment and empirically higher generations.

In parallel, Black Forest Labs’ FLUX.2 AE work frames latent design as a trade-off between learnability, quality, and compression.Their core argument is that improving learnability requires injecting semantic structure into the representation, somewhat than treating the tokenizer as a pure compression module. This motivates retraining the latent space to explicitly goal “higher learnability and better image quality at the identical time”. They don’t share the complete recipe, but they do clearly state the important thing idea: make the AE’s latent space more learnable by adding semantic or representation alignment, and explicitly point to REPA-style alignment with a frozen vision encoder because the mechanism they construct on and integrate into the FLUX.2 AE.

What we observed

To probe alignment within the latent space, we compared two pretrained autoencoders as drop-in tokenizers for a similar flow backbone: a REPA-E-VAE (where we do add the REPA alignment objective, as within the paper) and the Flux2-AE (where we don’t add REPA, following their advice). The outcomes were, truthfully, extremely impressive, each quantitatively and qualitatively. In samples, the gap is straight away visible: generations show more coherent global structure and cleaner layouts, with far fewer “early training” artifacts.

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.20	0.41	0.39	3.95
Flux2-AE	12.07	0.09	0.08	1.79
REPA-E-VAE	12.08	0.26	0.18	3.39

A primary striking point is that each latent-space interventions lower the FID by ~6 points (18.20 to ~12.08), which is a much larger jump than what we typically get from “just” aligning intermediate features. This strongly supports the core idea: if the tokenizer produces a representation that’s intrinsically more learnable, the flow model advantages in all places.

The 2 AEs then behave quite in another way in the small print. Flux2-AE dominates most metrics (very low CMMD and DINO_MMD, however it comes with an enormous throughput penalty: batches/sec drops from 3.95 to 1.79. In our case this slowdown is explained by practical aspects additionally they emphasize: the model is just heavier, and it also produces a larger latent (32 channels), which increases the quantity of labor the diffusion backbone has to do per step.

REPA-E-VAE is the “balanced” option: it reaches essentially the identical FID as Flux2-AE while keeping throughput much closer to the baseline (3.39 batches/sec).

Baseline	Flux2-AE	REPA-E-VAE

Training Objectives: Beyond Vanilla Flow Matching

Architecture gets you capability, however the training objective is what decides how that capability is used. In practice, small changes to the loss often have outsized effects on convergence speed, conditional fidelity, and the way quickly a model “locks in” global structure. Within the sections below, we are going to undergo the objectives we tested on top of our baseline rectified flow setup, starting with an easy but surprisingly effective modification: Contrastive Flow Matching.

Contrastive Flow Matching (Stoica et al., 2025)

Flow matching has a pleasant property within the unconditional case: trajectories are implicitly encouraged to be unique (flows shouldn’t intersect). But once we move to conditional generation (class- or text-conditioned), different conditions can still induce overlapping flows, which empirically shows up as “averaging” behavior: weaker conditional specificity, and muddier global structure. Contrastive flow matching addresses this directly by adding a contrastive term that pushes conditional flows away from other flows within the batch.

_{Contrastive flow matching makes class-conditional flows more distinct, reducing overlap seen in standard flow matching, and produces higher-quality images that higher represent each class. Figure from arXiv:2506.05350.}

For a given training triplet $(x, y, ε) (x, y, varepsilon)$

$mathcal{L}_{Delta text{FM}}(theta) = mathbb{E}Big[ |v_theta(x_t,t,y)-(dot{alpha}_t x+dot{sigma}_tvarepsilon)|^2 ;-; lambda |v_theta(x_t,t,y)-(dot{alpha}_t tilde{x}+dot{sigma}_ttilde{varepsilon})|^2 Big]$

where $λ \in [0, 1) lambdain[0,1)$

The authors show that contrastive flow matching produces more discriminative trajectories and that this translates into both quality and efficiency gains: faster convergence (reported up to 9× fewer training iterations to reach similar FID) and fewer sampling steps (reported up to 5× fewer denoising steps) on ImageNet (Deng et al. 2009) and CC3M(Sharma et al., 2018) experiments.

A key advantage is that the objective is almost a drop-in replacement: you keep the usual flow-matching loss, then add a single contrastive “push-away” term using other samples in the same batch as negatives which provides the extra supervision without introducing additional model passes.

What we observed

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.20	0.41	0.39	3.95
Contrastive-FM	20.03	0.40	0.36	3.75

On this run, contrastive flow matching yields a small but measurable improvement on the representation-driven metrics: CMMD goes from 0.41 → 0.40 and DINO-MMD from 0.39 → 0.36. The magnitude of the gain is smaller than what the paper reports on ImageNet, which is not too surprising: text conditioning is much more complex than discrete classes, and the training data distribution is likely less “separable” than ImageNet, making the contrastive signal harder to exploit.

We do not see an improvement in FID in this specific experiment (it slightly worsens), but the throughput cost is negligible in practice (3.95 → 3.75 batches/sec). Given the simplicity of the change and the consistent movement in the right direction for the conditioning/representation metrics, we will likely still keep contrastive flow matching in our training pipeline as a low-cost regularizer.

JiT (Li and He, 2025)

Back to Basics: Let Denoising Generative Models Denoise is probably one of our favorite recent papers in the diffusion space because it is not a new trick but a reset: stop asking the network to predict off-manifold quantities (noise or velocity) and just let it denoise.
Most modern diffusion and flow models train the network to predict noise $ε varepsilon$ or a mixed quantity like velocity $v$ . Under the manifold assumption, natural images live on a low-dimensional manifold, while $ε varepsilon$

_{Under the manifold assumption, clean images lie on the data manifold while noise and velocity do not. Thus training the model to predict clean images is fundamentally easier than training it to predict noise-like targets. Figure from arXiv:2511.13720.}

The authors frame the problem with the standard linear interpolation between the clean image $x$

Instead of outputting $v_{θ} v_theta$

Then we can keep the exact same flow-style objective in v-space:
$mathcal{L}_{v} = mathbb{E}_{t,x,varepsilon}left[left|v_theta(z_t,t) – vright|_2^2right] quadtext{with}quad v = x-varepsilon.$

This formulation makes the training problem substantially easier in high dimensions: as a substitute of predicting noise or velocity (that are essentially unconstrained in pixel space), the network predicts the clean image $x$

What we observed

We first evaluated x-prediction in the identical setting as the remainder of our objective experiments, namely training within the FLUX latent space at 256×256 resolution.

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.20	0.41	0.39	3.95
X-Pred	16.80	0.54	0.49	3.95

On this regime, the good thing about x-prediction is unclear. While FID improves barely in comparison with the baseline, each CMMD and DINO-MMD degrade noticeably, and throughput is unchanged. This implies that, when working in an already well-structured latent space, predicting clean images as a substitute of velocity doesn’t consistently dominate the baseline objective, and might even hurt representation-level alignment.

That said, this experiment just isn’t where x-prediction really shines.

The exciting part is that x-prediction stabilizes high-dimensional training, making it feasible to make use of larger patches and denoise directly in pixel space, with out a VAE, at much higher resolutions. Using JiT, we trained a model directly on 1024×1024 images with 32×32 patches, as a substitute of operating in a compressed latent space. Despite the much higher resolution and the absence of a tokenizer, optimization remained stable and fast. We reached FID 17.42, DINO_MMD 0.56, and CMMD 0.71 with a throughput of 1.33 batches/sec.

These results are remarkable: training directly on 1024×1024 images is simply about 3× slower than training in a 256×256 latent space, while operating on raw pixels. This strongly supports the core claim of Back to Basics: letting the model predict clean images makes the training problem significantly easier, and opens the door to high-resolution, tokenizer-free text-to-image training without prohibitive compute costs.

Because of this, we plan to make use of this formulation because the backbone of our upcoming speedrun experiments, to see how far we are able to push it when combined with the opposite efficiency and sparsification techniques discussed above. The important downside for now’s that this approach doesn’t allow us to profit from the very nice properties of the FLUX.2 VAE; exploring whether some type of alignment or hybrid training could make these two worlds compatible is an open direction we plan to analyze further.

Token Routing and Sparsification to Reduce Compute Costs

Up to now, many of the techniques we discussed deal with making each training step more practical: improving the target, shaping the representations, or accelerating convergence. The following lever is orthogonal: make each step cheaper.

For diffusion and flow transformers, the dominant cost is running deep transformer stacks over a big set of image/latent tokens where attention scales poorly with sequence length. Token sparsification methods goal this directly by ensuring that only a subset of tokens pays the complete compute price within the expensive parts of the network, while still preserving enough information flow to maintain quality high.
Most masking approaches speed up training by removing tokens from the forward pass, then asking the model to hallucinate the missing content from learned placeholders. That works surprisingly well, however it violates the spirit of iterative denoising. As an alternative of refining all of the content in each step, we’re reconstructing parts from scratch.

Two recent papers illustrate a cleaner alternative: as a substitute of deleting information, they reorganize where compute is spent. TREAD and SPRINT share the identical high-level objective of avoiding full-depth computation for each token at every layer, but they pursue it through complementary strategies.

TREAD‘s (Krause et al., 2025) core idea is to interchange compute reduction through information loss, resembling dropping or masking tokens, with compute reduction through information transport using token routing. It introduces a route: for every training sample, it randomly selects a fraction of tokens and temporarily bypasses a contiguous chunk of layers, then re-injects those tokens later. Tokens usually are not discarded. As an alternative, they avoid paying the fee of full depth.
Concretely, for a denoiser with a stack of blocks $L_0, dots, L_{B-1}$

_{. TREAD enhances training efficiency by routing tokens around certain layers. Figure from arXiv:/2501.04765.}

SPRINT (Park et al., 2025) extends this approach by introducing sparsity in essentially the most computationally expensive parts of the network, while preserving a dense information pathway. Its recipe is intentionally structured: run dense early layers over all tokens to construct reliable low-level features, then keep only a subset of tokens through the sparse middle layers where compute is heaviest, and eventually go dense again by re-expanding and fusing sparse deep features with a dense residual stream from the early layers, before producing the output. The important thing distinction from TREAD is where robustness comes from: TREAD keeps tokens “present” but shallower (routing), whereas SPRINT allows many tokens to be absent in the center blocks, counting on the dense residual path to preserve full-resolution information. That is what enables more aggressive sparsification in practice. The paper explores drop ratios around 75%, versus ~50% for TREAD.

_{SPRINT goes beyond TREAD by dropping most tokens in the center layers while keeping a dense residual path to preserve full-resolution information. Figure from arXiv:/2510.21986.}

What we observed

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.20	0.41	0.39	3.95
TREAD	21.61	0.55	0.41	4.11
SPRINT	22.56	0.72	0.42	4.20

Under our standard 256×256 latent setup, each methods deliver the first profit we were targeting. TREAD goes from 3.95 → 4.11 batches/sec, and SPRINT pushes it a bit further to 4.20 batches/sec. The price is that under our evaluation protocol, this extra throughput comes with a transparent loss in quality: FID rises from 18.20 to 21.61 (TREAD) and 22.56 (SPRINT), with the identical pattern observed in CMMD and DINO-MMD.

Taken at face value, routing yields a modest ~7–9% throughput gain, however it comes with worse metrics on this benchmark, with SPRINT (the more aggressive scheme) degrading quality barely greater than TREAD.

One necessary caveat is that token-sparse / routed models are inclined to rating worse under vanilla Classifier-Free Guidance (CFG), and this effect is probably going amplified here because these runs are still relatively undertrained in our setting. The authors of Guiding Token-Sparse Diffusion Models (Krause et al., 2025) argue that is partly an evaluation mismatch: routing changes the model’s effective capability, and plain “conditional vs. unconditional” CFG often becomes less effective, which might artificially reduce quality. We deliberately did not use specialized guidance schemes to maintain our benchmark consistent across methods, and at this stage it could also not be very meaningful to treat the sparse model as a “bad version of itself” for guidance. Because of this, we consider these numbers directionally useful, but still pessimistic and value interpreting with caution.

At 256×256, routing only gave modest gains since the model processes relatively few tokens. At 1024×1024, the image changes completely. With 1024 tokens, routing finally targets the dominant cost, and the outcomes are striking.

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	17.42	0.71	0.56	1.33
TREAD	14.10	0.46	0.37	1.64
SPRINT	16.90	0.51	0.41	1.89

Each TREAD and SPRINT deliver large throughput gains over the dense baseline, with SPRINT pushing speed the furthest. More importantly, this time the gains do not come on the expense of quality but quite the other. TREAD particularly stands out, with a dramatic drop in FID (17.42 → 14.10) alongside strong improvements in CMMD and DINO-MMD. SPRINT is barely more aggressive and a bit noisier in quality, but still clearly improves over the baseline while being the fastest option.

Briefly, that is the regime where token routing really shines: high resolution, many tokens, and JiT-style pixel-space training. Here, routing isn’t any longer a marginal optimization—it’s a serious lever that improves each how briskly and how well the model trains.

Data

After covering representation alignment, the core training objective, and token routing, we turned to the fourth axis that kept continuously mattered in practice: data. We found that the alternative of coaching data, including the way it is described through captions, can influence the trajectory of a training run as much as optimization techniques. Below are three concrete data experiments that consistently moved the needle in our setup.

Long vs. Short Captions

Captions are a necessary a part of the training set: for a text-to-image model, they usually are not just metadata, they’re the supervision. The DALL·E 3 (Betker et al., 2023) research paper showed that richer captions will be certainly one of the strongest levers for improving training signal and prompt-following.
To isolate the effect in our setup, we kept the whole lot else fixed and altered only the caption style to match:

Long, descriptive captions (our baseline): multi-clause captions that mention composition, attributes, lighting, materials, and relationships.

Example
“A photograph depicts a fluffy lop-eared rabbit sitting on a weathered picket surface outdoors. The rabbit is predominantly white with patches of sunshine brown and tan fur, particularly on its head and ears. Its ears droop noticeably, and its fur appears soft and thick. The rabbit’s eyes are dark and expressive. It’s positioned barely off-center, facing towards the left of the frame. Behind the rabbit, barely out of focus, is a miniature dark red metal wheelbarrow. A partially visible orange apple sits to the left of the rabbit. Fallen autumn leaves, predominantly reddish-brown, are scattered across the rabbit and apple on the picket surface. The background is a blurred but visible expanse of green grass, suggesting an out of doors setting. The lighting is soft and natural, likely diffused daylight, casting no harsh shadows. The general atmosphere is calm, peaceful, and autumnal. The aesthetic is rustic and charming, with a deal with the rabbit because the important subject. The colour palette is muted and natural, consisting mainly of whites, browns, oranges, and greens. The style is naturalistic and simple, with none overt artistic manipulation. The vibe is gentle and heartwarming.”
Short, one-line captions: minimal descriptions with much less structure.

Example
“A rabbit sitting on top of a picket table.”

What we observed

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Baseline	18.20	0.41	0.39	3.95
Short-Captions	36.84	0.98	1.14	3.95

The end result was unambiguous: switching to short captions severely hurt convergence across all metrics.
Long captions provide a richer supervision signal: beyond prompt adherence, there may be a really practical optimization reason. More tokens often means more information, and subsequently more learning signal for the denoiser. When the conditioning text specifies composition, attributes, lighting, materials, and relationships, the model gets a sharper “goal” for what the denoising trajectory should preserve and refine, especially early in training.

The fun paradox is that this extra detail often makes the training problem easier, not harder: intuitively, one might expect longer prompts, with more attributes, constraints, and relationships, to extend complexity and burden the model. In practice, the other happens. Short captions leave many degrees of freedom unspecified, forcing the model to learn under ambiguity and implicitly average across multiple plausible interpretations. Long captions collapse that uncertainty by turning implicit decisions into explicit constraints, allowing the denoiser to focus its capability on refining a well-posed solution as a substitute of guessing what matters.

Long captions are a robust training-time accelerator, but we still want the model to behave well on short prompts because that’s how people actually use these systems. An easy workaround is to finish training with a brief fine-tuning stage on a combination of long and short captions. That keeps the advantages of wealthy supervision early, while teaching the model to remain robust when conditioning is sparse.

Bootstrapping With Synthetic Images

One other data-related research query we explore is whether or not a low-cost synthetic corpus can speed up early training in comparison with an actual corpus of comparable size. For this benchmark, we trained a model on a dataset of real images collected from Pexels and compare it with our Baseline which was trained on synthetic data generated with MidjourneyV6, each of which have around 1M images.
We evaluated each runs against the identical Unsplash reference set, composed exclusively of real images.

Method	FID ↓	CMMD ↓	DINO-MMD ↓	batches/sec ↑
Synthetic images	18.20	0.41	0.39	3.95
Real images	16.6	0.5	0.46	3.95

The synthetic-trained model scores higher on CMMD and DINO-MMD, while the model trained on real images achieves a lower FID. Relatively than a contradiction, this split mostly reflects what these metrics emphasize.

FID is especially sensitive to low-level image statistics: fantastic textures, high-frequency detail, noise patterns, and the subtle irregularities of real photography. Since our evaluation reference consists of real images, a model trained on real photos naturally matches those statistics more closely, which translates right into a higher FID. Synthetic images, in contrast, often exhibit barely different high-frequency signatures, cleaner edges, smoother micro-textures, more uniform noise, that are barely noticeable qualitatively but still get penalized by distributional metrics like FID.

Qualitatively, this difference is simple to identify. Models trained on synthetic data are inclined to produce images with cleaner global structure and stronger compositional and object coherence, but additionally exhibit a more synthetic appearance, characterised by smoother textures and reduced photographic noise. In contrast, models trained on real images higher capture the irregular, fine-grained textures typical of natural photographs, though they often require more training to realize comparable global structure.

One plausible explanation synthetic data stays so effective early on is that it exposes the model to a wider range of compositional collisions: unusual pairings of objects, attributes, styles, and viewpoints that rarely co-occur in natural datasets. While this will hurt realism at the feel level, it forces the model to elucidate a broader space of mixtures, which appears to assist with early disentanglement and structure learning.

Considered jointly, this means an easy but practical strategy: synthetic data is an efficient technique to bootstrap training and lock in global structure quickly, while real images remain necessary in a while if matching photographic texture statistics is the priority.

SFT With Alchemist: Small Dataset, Real Impact

Finally, we experimented with a targeted Supervised Wonderful-Tuning (SFT) pass using Alchemist (Startsev et al., 2025), a compact dataset explicitly curated for high-impact. Alchemist is small by design (3,350 image–text pairs), but is constructed through a complicated curation pipeline that starts from a web-scale pool and progressively distills it right down to visually exceptional samples.

In our setup, we fine-tuned our preview models for 20K steps on Alchemist. Despite the dataset’s small size, it had an outsized effect: it adds a definite “style layer” with higher composition, more photographic polish, and richer scenes with out a clear impact on generalization.

The samples below show a side-by-side comparison of generations from the identical base model, before and after the Alchemist fine-tuning pass.

More Useful Suggestions for Training

Last but not least, we are going to briefly cover two practical training details that turned out to matter greater than we expected.
These aspects are easily missed and in our case they’d a transparent impact on convergence and final image quality.

Muon Optimizer

We generally default to AdamW for our benchmarks since it’s predictable and simple to match across runs. Nonetheless, recently, we now have seen a renewed interest in optimizers that attempt to behave more like a great preconditioner without the complete overhead of second-order methods. One recent example is Muon (Jordan et al., 2024), which, at a high level, tries to enhance optimization by applying a better-conditioned update step, often translating into faster convergence and cleaner progress early in training.

In our setup, Muon was certainly one of the rare cases wherein a change of optimizer produced an immediatly observable effect on the metrics.

Method	FID ↓	CMMD ↓	DINO-MMD ↓
Baseline	18.20	0.41	0.39
Muon	15.55	0.36	0.35

For this experiment, we used the official PyTorch implementation of Muon, which for the time being supports Distributed Data Parallel (DDP) training only. In the event you’re running Fully Sharded Data Parallel (FSDP), there are community variants available; for instance here.

While we refrain from broad conclusions based on a single benchmark, these results indicate that optimizer alternative extends beyond stability considerations and might yield tangible gains in time-to-quality.

Precision Gotcha: Casting vs. Storing weights in BF16

We eventually identified an error in our setup, where the denoiser weights were mistakenly stored in bfloat16 for a time period.

To be clear, using the BF16 autocast is great. Running the forward and backward passes in BF16 or mixed precision is standard and typically what you would like for speed and memory. The issue arises from keeping the parameters in BF16 precision, which negatively impacts numerically sensitive operations.

In practice, some layers and operations are much less tolerant to reduced parameter precision:

normalization layers (e.g. LayerNorm / RMSNorm statistics),
attention softmax/logits paths,
RoPE,
optimizers’ internal state / update dynamics.

Method	FID ↓	CMMD ↓	DINO-MMD ↓
Baseline	18.20	0.41	0.39
BF16 weights (bug)	21.87	0.61	0.57

So the rule we now follow very strictly is: use BF16 autocast for compute, but keep weights (and optimizer state) in FP32 or no less than ensure numerically sensitive modules stay FP32.

It just isn’t a glamorous trick however it is precisely the sort of “silent” detail that may cost you multiple days of labor for those who don’t notice it early.

Summary

We ran a scientific set of ablations on PRX training, comparing a spread of optimization, representation, efficiency, and data decisions against a clean flow-matching baseline using each quality metrics and throughput.

The largest gains got here from alignment: REPA boosts early convergence (best used as a burn-in, then turned off), and higher latents/tokenizers (REPA-E/FLUX2-AE) give a big jump in quality with clear speed trade-offs. Objective tweaks were mixed—contrastive FM helped barely, while x-prediction mattered most by enabling stable 1024² pixel training. Token routing (TREAD/SPRINT) is minor at 256² but becomes a serious win at high resolution. Data and practical details also mattered: long captions are critical, synthetic vs. real data shifts texture vs. structure, small SFT adds polish, Muon helped, and BF16-stored weights quietly hurt training.

That’s it for Part 2! If you need to play with an earlier public checkpoint from this series, the PRX-1024 T2I beta remains to be available here.

Weare really enthusiastic about what’s next: in the approaching weeks we are going to release the complete source code of the PRX training framework, and we are going to do a public 24-hour “speedrun” where we mix the very best ideas from this post right into a single run and see how far the complete recipe can go in in the future.

In the event you made it this far, initially thanks very much to your interest. Moreover, we might like to have you ever join our Discord community where we discuss PRX progress and results, together with the whole lot related to diffusion and text-to-image models.

Source link

Training Design for Text-to-Image Models: Lessons from Ablations

The Baseline

Benchmarking Metrics