Towards LoRAs That Can Survive Model Version Upgrades

Since my recent coverage of the expansion in hobbyist Hunyuan Video LoRAs (small, trained files that may inject custom personalities into multi-billion parameter text-to-video and image-to-video foundation models), the variety of related LoRAs available on the Civit community has risen by 185%.

Source: https://civitai.com/

The identical community that’s scrambling to learn how one can produce these ‘add-on personalities’ for Hunyuan Video (HV) can be ulcerating for the promised release of an image-to-video (I2V) functionality in Hunyuan Video.

With regard to open source human image synthesis, it is a big deal; combined with the expansion of Hunyuan LoRAs, it could enable users to rework photos of individuals into videos in a way that doesn’t erode their identity because the video develops – which is currently the case in all state-of-the-art image-to-video generators, including Kling, Kaiber, and the much-celebrated RunwayML:

. Source: https://app.runwayml.com/

By developing a custom LoRA for the personality in query, one could, in a HV I2V workflow, use an actual photo of them as a start line. It is a much better ‘seed’ than sending a random number into the model’s latent space and settling for whatever semantic scenario results. One could then use the LoRA, or multiple LoRAs, to keep up consistency of identity, hairstyles, clothing and other pivotal elements of a generation.

Potentially, the supply of such a mixture could represent one of the crucial epochal shifts in generative AI for the reason that launch of Stable Diffusion, with formidable generative power handed over to open source enthusiasts, without the regulation (or ‘gatekeeping’, if you happen to prefer) provided by the content censors in the present crop of popular gen vid systems.

As I write, Hunyuan image-to-video is an unticked ‘to do’ within the Hunyuan Video GitHub repo, with the hobbyist community reporting (anecdotally) a Discord comment from a Hunyuan developer, who apparently stated that the discharge of this functionality has been pushed back to a while later in Q1 on account of the model being ‘too uncensored’.

Source: https://github.com/Tencent/HunyuanVideo?tab=readme-ov-file#-open-source-plan

Accurate or not, the repo developers have substantially delivered on the remaining of the Hunyuan checklist, and due to this fact Hunyuan I2V seems set to reach eventually, whether censored, uncensored or in a roundabout way ‘unlockable’.

But as we will see within the list above, the I2V release is seemingly a separate model entirely – which makes it pretty unlikely that any of the present burgeoning crop of HV LoRAs at Civit and elsewhere will function with it.

On this (by now) predictable scenario, LoRA training frameworks corresponding to Musubi Tuner and OneTrainer will either be set back or reset in regard to supporting the brand new model. Meantime, one or two of essentially the most tech-savvy (and entrepreneurial) YouTube AI luminaries will ransom their solutions via Patreon until the scene catches up.

Upgrade Fatigue

Almost no-one experiences upgrade fatigue as much as a LoRA or fine-tuning enthusiast, since the rapid and competitive pace of change in generative AI encourages model foundries corresponding to Stability.ai, Tencent and Black Forest Labs to provide greater and (sometimes) higher models at the utmost viable frequency.

Since these new-and-improved models will on the very least have different biases and weights, and more commonly may have a special scale and/or architecture, which means the fine-tuning community has to get their datasets out again and repeat the grueling training process for the new edition.

Because of this, a multiplicity of Stable Diffusion LoRA version types can be found at Civit:

The upgrade trail, visualized in search filter options at civit.ai

Since none of those lightweight LoRA models are interoperable with higher or lower model versions, and since lots of them have dependencies on popular large-scale merges and fine-tunes that adhere to an older model, a good portion of the community tends to keep on with a ‘legacy’ release, in much the identical way as customer loyalty to Windows XP endured years after official past support ended.

Adapting to Change

This subject involves mind due to a recent paper from Qualcomm AI Research that claims to have developed a way whereby existing LoRAs might be ‘upgraded’ to a newly-released model version.

Source: https://arxiv.org/pdf/2501.16559

This doesn’t mean that the brand new approach, titled , can translate freely between all models of the identical type (i.e., text to image models, or Large Language Models [LLMs]); however the authors have demonstrated an efficient transliteration of a LoRA from Stable Diffusion v1.5 > SDXL, and a conversion of a LoRA for the text-based TinyLlama 3T model to TinyLlama 2.5T.

LoRA-X transfers LoRA parameters across different base models by preserving the adapter throughout the source model’s subspace; but only in parts of the model which are adequately similar across model versions.

On the left, a schema for the way that the LoRA-X source model fine-tunes an adapter, which is then adjusted to fit the target model using its own internal structure. On the right, images generated by target models SD Eff-v1.0 and SSD-1B, after applying adapters transferred from SD-v1.5 and SDXL without additional training.

While this offers a practical solution for scenarios where retraining is undesirable or inconceivable (corresponding to a change of license on the unique training data), the tactic is restricted to similar model architectures, amongst other limitations.

Though it is a rare foray into an understudied field, we won’t examine this paper in depth due to LoRA-X’s quite a few shortcomings, as evidenced by comments from its critics and advisors at Open Review.

The strategy’s reliance on subspace similarity restricts its application to closely related models, and the authors have conceded within the review forum that LoRA-X can’t be easily transferred across significantly different architectures

Other PEFT Approaches

The opportunity of making LoRAs more portable across versions is a small but interesting strand of study within the literature, and the fundamental contribution that LoRA-X makes to this pursuit is its contention that it requires no training. This just isn’t strictly true, if one reads the paper, however it does require the least training of all of the prior methods.

LoRA-X is one other entry within the canon of Parameter-Efficient Wonderful-Tuning (PEFT) methods, which address the challenge of adapting large pre-trained models to specific tasks without extensive retraining. This conceptual approach goals to change a minimal variety of parameters while maintaining performance.

Notable amongst these are:

X-Adapter

The X-Adapter framework transfers fine-tuned adapters across models with a certain quantity of retraining. The system goals to enable pre-trained plug-and-play modules (corresponding to ControlNet and LoRA) from a base diffusion model (i.e., Stable Diffusion v1.5) to work directly with an upgraded diffusion model corresponding to SDXL without retraining – effectively acting as a ‘universal upgrader’ for plugins.

The system achieves this by training an extra network that controls the upgraded model, using a frozen copy of the bottom model to preserve plugin connectors:

Source: https://arxiv.org/pdf/2312.02238

X-Adapter was originally developed and tested to transfer adapters from SD1.5 to SDXL, while LoRA-X offers a greater diversity of transliterations.

DoRA (Weight-Decomposed Low-Rank Adaptation)

DoRA is an enhanced fine-tuning method that improves upon LoRA by utilizing a weight decomposition strategy that more closely resembles full fine-tuning:

DORA does not just attempt to copy over an adapter in a frozen environment, as LoRA-X does, but instead changes fundamental parameters of the weights, such as magnitude and direction. Source: https://arxiv.org/pdf/2402.09353

Source: https://arxiv.org/pdf/2402.09353

DoRA focuses on improving the fine-tuning process itself, by decomposing the model’s weights into magnitude and direction (see image above). As a substitute, LoRA-X focuses on enabling the transfer of existing fine-tuned parameters between different base models

Nonetheless, the LoRA-X approach adapts the techniques developed for DORA, and in tests against this older system claims an improved DINO rating.

FouRA (Fourier Low Rank Adaptation)

Published in June of 2024, the FouRA method comes, like LoRA-X, from Qualcomm AI Research, and even shares a few of its testing prompts and themes.

Examples of distribution collapse in LoRA, from the 2024 FouRA paper, using the Realistic Vision 3.0 model trained with LoRA and FouRA for ‘Blue Fire’ and ‘Origami’ style adapters, across four seeds. LoRA images exhibit distribution collapse and reduced diversity, whereas FouRA generates more varied outputs. Source: https://arxiv.org/pdf/2406.08798

Source: https://arxiv.org/pdf/2406.08798

FouRA focuses on improving the variety and quality of generated images by adapting LoRA within the frequency domain, using a Fourier transform approach.

Here, again, LoRA-X was capable of achieve higher results than the Fourier-based approach of FouRA.

Though each frameworks fall throughout the PEFT category, they’ve very different use cases and approaches; on this case, FouRA is arguably ‘making up the numbers’ for a testing round with limited like-for-like rivals for the brand new paper’s authors engage with.

SVDiff

SVDiff also has different goals to LoRA-X, but is strongly leveraged in the brand new paper. SVDiff is designed to enhance the efficiency of the fine-tuning of diffusion models, and directly modifies values throughout the model’s weight matrices, while keeping the singular vectors unchanged. SVDiff uses truncated SVD, modifying only the biggest values, to regulate the model’s weights.

This approach uses a knowledge augmentation technique called :

Source: https://arxiv.org/pdf/2303.11305

Cut-Mix-Unmix is designed to assist the diffusion model learn multiple distinct concepts without intermingling them. The central idea is to take images of various subjects and concatenate them right into a single image. Then the model is trained with prompts that explicitly describe the separate elements within the image. This forces the model to acknowledge and preserve distinct concepts as a substitute of mixing them.

During training, an extra regularization term helps prevent cross-subject interference. The authors’ theory contends that this facilitates improved multi-subject generation, where each element stays visually distinct, moderately than being fused together.

SVDiff, excluded from the LoRA-X testing round, goals to create a compact parameter space. LoRA-X, as a substitute, focuses on the transferability of LoRA parameters across different base models by operating throughout the subspace of the unique model.

Conclusion

The methods discussed here will not be the only denizens of PEFT. Others include QLoRA and QA-LoRA; Prefix Tuning; Prompt-Tuning; and adapter-tuning, amongst others.

The ‘upgradable LoRA’ is, perhaps, an alchemical pursuit; actually, there’s nothing immediately on the horizon that can prevent LoRA modelers from having to tug out their old datasets again for the most recent and best weights release. If there may be some possible prototype standard for weights revision, able to surviving changes in architecture and ballooning parameters between model versions, it hasn’t emerged within the literature yet, and might want to keep being extracted from the information on a per-model basis.

Towards LoRAs That Can Survive Model Version Upgrades

Upgrade Fatigue

Adapting to Change