Easy model definitions powering the AI ecosystem

Transformers’ version v4.0.0rc-1, the initial release candidate for version 4, was released on November nineteenth, 2020. Five years later, we now release v5.0.0rc-0.

Today, as we launch v5, Transformers is installed greater than 3 million times every day via pip – up from 20,000/day in v4 🤯. Altogether, it has now surpassed 1.2 billion installs!

The ecosystem has expanded from 40 model architectures in v4 to over 400 today, and the community has contributed greater than 750,000 model checkpoints on the Hub compatible with Transformers, up from roughly 1,000 on the time of v4.

This growth is powered by the evolution of the sphere and the now mainstream access to AI. As a number one model-definition library within the ecosystem, we want to constantly evolve and adapt the library to proceed being relevant. Reinvention is vital for longevity in AI.

We’re fortunate to collaborate with many libraries and apps built on transformers, in no specific order: llama.cpp, MLX, onnxruntime, Jan, LMStudio, vLLM, SGLang, Unsloth, LlamaFactory, dLLM, MaxText, TensorRT, Argmax, amongst many other friends.

For v5, we desired to work on several notable features: simplicity, training, inference, and production. We detail the work that went into them on this post.

Simplicity

The primary focus of the team was on simplicity. Working on transformers, we see the code because the product. We would like our model integrations to be clean, in order that the ecosystem may rely upon our model definitions and understand what’s really happening under the hood, how models differ from one another, and the important thing features of every recent model. Simplicity ends in wider standardization, generality, and wider support.

Model Additions

Transformers is the backbone of a whole bunch of hundreds of projects, Unsloth included. We construct on Transformers to
help people fine-tune and train models efficiently, whether that’s BERT, text-to-speech (TTS), or others; to run
fast inference for reinforcement learning (RL) even when models aren’t yet supported in other libraries. We’re
excited for Transformers v5 and are super comfortable to be working with the Hugging Face team!

— Michael Han at Unsloth

Transformers, on the core, stays a model architecture toolkit. We aim to have all recent architectures and to be the “source of truth” for model definitions. We’ve been adding between 1 – 3 recent models every week for five years, shown within the timeline below:

We’ve worked on improving that model-addition process.

Modular Approach

Over the past yr, we’ve heavily pushed our modular design as a major step forward. This permits for easier maintenance, faster integration, and higher collaboration across the community.

We give a deeper overview in our Maintain the Unmaintainable blog post. For brevity, we aim to attain a much easier model contribution process, in addition to a lower maintenance burden. One metric we will highlight is that the variety of lines of code to contribute (and review), drop significantly when modular is used:

Transformers standardizing model definitions

While we respect the “One model, one file” philosophy, we proceed introducing some abstractions making the management of common helpers simpler. The prime example of that is the introduction of the AttentionInterface, which offers a centralized abstraction for attention methods. The eager method will remain within the modeling file; others, resembling FA1/2/3, FlexAttention, or SDPA, are moved to the interface.

Over the past couple of years, the increasing amount of 0-day support for brand spanking new model architectures and
standardization of attention handling has helped to simplify our support for post-training modern LLMs.

— Wing Lian, Axolotl

Tooling for Model Conversion

We’re constructing tooling to assist us discover which existing model architecture a brand new model resembles. This feature uses machine learning to search out code similarities between independent modeling files. Going further, we aim to automate the conversion process by opening a draft PR for the model to be integrated into our transformers format. This process reduces manual effort and ensures consistency.

Code Reduction

Streamlining Modeling & Tokenization/Processing Files

We’ve significantly refactored the modeling and tokenization files. Modeling files have been greatly improved due to the modular approach mentioned above, on top of standardization across models. Standardization contributes to abstracting a lot of the tools that don’t make up a model, in order that the modeling code only accommodates the relevant parts for a model’s forward/backward passes.

Alongside this work, we’re simplifying the tokenization and processing files: going forward, we’ll only concentrate on the tokenizers backend, removing the concept of “Fast” and “Slow” tokenizers.

We’ll use tokenizers as our primary tokenization backend, just as we do for PyTorch-based models. We’ll offer alternatives for Sentencepiece or MistralCommon backed tokenizers, which shall be non-default but shall be supported. Image processors will now only exist with their fast variant, which is determined by the torchvision backend.

Finally, we’re sunsetting our Flax/TensorFlow support in favor of specializing in PyTorch as the only backend; nonetheless,
we’re also working with partners within the Jax ecosystem to make sure we now have compatibility between our models and this
ecosystem.

With its v5 release, transformers goes all in on PyTorch. Transformers acts as a source of truth and
foundation for modeling across the sphere; we have been working with the team to make sure good performance
across the stack.

We’re excited to proceed pushing for this in the long run across training, inference, and deployment.

— Matt White, Executive Director, PyTorch Foundation. GM of AI, Linux Foundation

Training

Training stays an enormous focus of the team as we head into v5: whereas previously we’d focus heavily on fine-tuning relatively than pre-training/full-training at scale, we’ve recently done significant work to enhance our support for the latter as well.

Pre-training at scale

Supporting pre-training meant reworking the initialization of our models, ensuring that they worked at scale with different parallelism paradigms, and shipping support for optimized kernels for each the forward and backward passes.

Going forward, we’re excited to have prolonged compatibility with torchtitan, megatron, nanotron, in addition to some other pre-training tool that’s focused on collaborating with us.

High-quality-tuning & Post-training

We proceed collaborating closely with all fine-tuning tools within the Python ecosystem. We aim to proceed providing model implementations compatible with Unsloth, Axolotl, LlamaFactory, TRL and others within the PyTorch ecosystem; but we’re also working with tools resembling MaxText, within the JAX ecosystem, to have good interoperability between their frameworks and transformers.

All fine-tuning and post-training tools can now depend on transformers for model definitions; further enabling Agentic use-cases through OpenEnv or the Prime Environment Hub.

Inference

We’re putting a major concentrate on inference for v5, with several paradigm changes: the introduction of specialised kernels, cleaner defaults, recent APIs, support for optimized inference engines.

Similarly to training, we’ve been putting some effort in packaging kernels in order that they’re mechanically utilized in case your hardware and software permits it. When you haven’t heard of kernels before, we recommend taking a have a look at this doc.

Alongside this effort, we ship two recent APIs dedicated to inference:

We ship support for continuous batching and paged attention mechanisms. This has now been used internally for a while, and we’re working on finalizing the rough edges and writing usage guides.
We introduce transformers serve as the brand new transformers-specific serving system, which deploys an OpenAI API-compatible server.

We see this as a serious step forward for use-cases resembling evaluation, where an ideal variety of inference requests are done concurrently. We don’t aim to do specialized optimizations just like the dedicated inference engines (vLLM, SGLang, TensorRT LLM). As an alternative, we aim to be perfectly inter-compatible with these, as detailed in the subsequent section.

The Transformers backend in vLLM has been very enabling to get more architectures, like BERT and other encoders, available to more users. We have been working with the Transformers team to make sure many models can be found across modalities with the very best performance possible. That is just the beginning of our collaboration: we’re comfortable to see the Transformers team could have this as a spotlight going into version 5.

— Simon Mo, Harry Mellor at vLLM

Standardization is vital to accelerating AI innovation. Transformers v5 empowers the SGLang team to spend less time on model reimplementation and more time on kernel optimization. We stay up for constructing a more efficient and unified AI ecosystem together!

— Chenyang Zhao at SGLang

Production & Local

Recently, we have been working hand in hand with the preferred inference engines for them to make use of transformers as a backend. The worth added is critical: as soon as a model is added to transformers, it becomes available in these inference engines, while making the most of the strengths each engine provides: inference optimizations, specialized kernels, dynamic batching, etc.

We have also been working very closely with ONNXRuntime, llama.cpp and MLX in order that the implementations between transformers and these modeling libraries have great interoperability. For instance, due to a major community effort, it’s now very easy to load GGUF files in transformers for further fine-tuning. Conversely, transformers models might be easily converted to GGUF files to be used with llama.cpp.

The Transformers framework is the go-to place for reference AI model implementations. The framework plays an important role in enabling modern AI across the complete stack. The team and the community behind the project truly understand and embrace the spirit of the open-source development and collaboration.

— Georgi Gerganov, ggml-org

The identical is true for MLX, where the transformers’ safetensors files are directly compatible with MLX’s models.

It’s hard to overstate the importance of Transformers (and datasets, tokenizers, etc) to the open-source and
overall AI ecosystem. I can’t count the variety of times I’ve personally used Transformers as a source-of-truth.

— Awni Hannun, MLX

Finally, we’re pushing the boundaries of local inference and are working hand-in-hand with the executorch team to get the transformers models to be available on-device. We’re expanding the coverage to multimodal models (vision, audio).

Quantization

Quantization is quickly emerging as the usual for state-of-the-art model development. Many SOTA models at the moment are released in low-precision formats resembling 8-bit and 4-bit (e.g., gpt-oss, Kimi-K2, Deepseek-r1), hardware is increasingly optimized for low-precision workloads, and the community is actively sharing high-quality quantized checkpoints. In v5, we’re making quantization a central focus of Transformers support, ensuring full compatibility with all major features, and delivering a reliable framework for training and inference.

We introduce a major change to the way in which we load weights in our models; and with this, we move to quantization being a first-class citizen.

Our collaboration with the Transformers team was highly productive, marked by their proactive code reviews,
feedback, and technical expertise. Their support was crucial in integrating TorchAO, expanding quantization
features, and improving documentation for broader adoption within the V5.

— Jerry Zhang at TorchAO

We’re excited that v5 has made quantization a first-class citizen. It provides the muse for bitsandbytes to raised support key features like TP and MoEs, and in addition makes it easier to integrate recent quantization methods.

— Matthew Douglas & Titus von Koeller, bitsandbytes

Conclusion

The overarching theme of this version 5 release is “interoperability”. All refactors, performance improvements, and standardization are aligned with this theme. v5 plays nicely and end-to-end with the growing ecosystem: train a model with Unsloth/Axolotl/LlamaFactory/MaxText deploy it with vLLM/SGLang, and export it to llama.cpp/executorch/MLX to run locally!

Version 5 is undeniably an accomplishment of the past five years by a really large number of individuals in our community. We also see it as a promise, and as a beacon of the direction we would like to go.

We took it as a chance to scrub up the toolkit and isolate what mattered; we now have a clean slate on top of which to construct. Because of the numerous changes from the community and team, improvements in performance, usability, and readability, shall be simpler to ship.

Now that v5.0.0’s first RC is on the market, we’ll be eagerly awaiting your feedback. Please check our release notes for all of the technical details, and we’ll be awaiting your feedback in our GitHub issues!

Source link

Easy model definitions powering the AI ecosystem