Within the rapidly evolving landscape of huge language models (LLMs), the highlight has largely focused on the decoder-only architecture. While these models have shown impressive capabilities across a wide selection of generation tasks, the classic encoder-decoder architecture, reminiscent of T5 (The Text-to-Text Transfer Transformer), stays a well-liked alternative for a lot of real-world applications. Encoder-decoder models often excel at summarization, translation, QA, and more as a consequence of their high inference efficiency, design flexibility, and richer encoder representation for understanding input. Nevertheless, the powerful encoder-decoder architecture has received little relative attention.
Today, we revisit this architecture and introduce T5Gemma, a brand new collection of encoder-decoder LLMs developed by converting pretrained decoder-only models into the encoder-decoder architecture through a way called adaptation. T5Gemma is predicated on the Gemma 2 framework, including adapted Gemma 2 2B and 9B models in addition to a set of newly trained T5-sized models (Small, Base, Large and XL). We’re excited to release pretrained and instruction-tuned T5Gemma models to the community to unlock recent opportunities for research and development.
From decoder-only to encoder-decoder
In T5Gemma, we ask the next query: can we construct top-tier encoder-decoder models based on pretrained decoder-only models? We answer this query by exploring a way called model adaptation. The core idea is to initialize the parameters of an encoder-decoder model using the weights of an already pretrained decoder-only model, after which further adapt them via UL2 or PrefixLM-based pre-training.

An outline of our approach, showing how we initialize a brand new encoder-decoder model using the parameters from a pretrained, decoder-only model.
This adaptation method is extremely flexible, allowing for creative mixtures of model sizes. As an example, we are able to pair a big encoder with a small decoder (e.g., a 9B encoder with a 2B decoder) to create an “unbalanced” model. This enables us to fine-tune the quality-efficiency trade-off for specific tasks, reminiscent of summarization, where a deep understanding of the input is more critical than the complexity of the generated output.
Towards higher quality-efficiency trade-off
How does T5Gemma perform?
In our experiments, T5Gemma models achieve comparable or higher performance than their decoder-only Gemma counterparts, nearly dominating the quality-inference efficiency pareto frontier across several benchmarks, reminiscent of SuperGLUE which measures the standard of the learned representation.

Encoder-decoder models consistently offer higher performance for a given level of inference compute, leading the quality-efficiency frontier across a spread of benchmarks.
This performance advantage is not only theoretical; it translates to real-world quality and speed too. When measuring the actual latency for GSM8K (math reasoning), T5Gemma provided a transparent win. For instance, T5Gemma 9B-9B achieves higher accuracy than Gemma 2 9B but with an analogous latency. Much more impressively, T5Gemma 9B-2B delivers a big accuracy boost over the 2B-2B model, yet its latency is almost equivalent to the much smaller Gemma 2 2B model. Ultimately, these experiments showcase that encoder-decoder adaptation offers a versatile, powerful strategy to balance across quality and inference speed.
Unlocking foundational and fine-tuned capabilities
Could encoder-decoder LLMs have similar capabilities to decoder-only models?
Yes, T5Gemma shows promising capabilities each before and after instruction tuning.
After pre-training, T5Gemma achieves impressive gains on complex tasks that require reasoning. As an example, T5Gemma 9B-9B scores over 9 points higher on GSM8K (math reasoning) and 4 points higher on DROP (reading comprehension) than the unique Gemma 2 9B model. This pattern demonstrates that the encoder-decoder architecture, when initialized via adaptation, has the potential to create a more capable, performant foundational model.

Detailed results for pretrained models, illustrating how adapted models have significant gains on several reasoning-intensive benchmarks in comparison with decoder-only Gemma 2.
These foundational improvements from pre-training set the stage for much more dramatic gains after instruction tuning. For instance, comparing Gemma 2 IT to T5Gemma IT, the performance gap widens significantly across the board. T5Gemma 2B-2B IT sees its MMLU rating jump by nearly 12 points over the Gemma 2 2B, and its GSM8K rating increases from 58.0% to 70.7%. The adapted architecture not only potentially provides a greater place to begin but in addition responds more effectively to instruction-tuning, ultimately resulting in a substantially more capable and helpful final model.

Detailed results for fine-tuned + RLHFed models, illustrating the capabilities of post-training to significantly amplify the performance benefits of the encoder-decoder architecture.
Explore our models: Releasing T5Gemma checkpoints
We’re very excited to present this recent approach to constructing powerful, general purpose encoder-decoder models by adapting from pretrained decoder-only LLMs like Gemma 2. To assist speed up further research and permit the community to construct on this work, we’re excited to release a collection of our T5Gemma checkpoints.
The discharge includes:
- Multiple Sizes: Checkpoints for T5-sized models (Small, Base, Large, and XL), the Gemma 2-based models (2B and 9B), in addition to a further model in between T5 Large and T5 XL.
- Multiple Variants: Pretrained and instruction-tuned models.
- Flexible Configurations: A robust and efficient unbalanced 9B-2B checkpoint to explore the trade-offs between encoder and decoder size.
- Different Training Objectives: Models trained with either PrefixLM or UL2 objectives to supply either state-of-the-art generative performance or representation quality.
We hope these checkpoints will provide a invaluable resource for investigating model architecture, efficiency, and performance.
Getting began with T5Gemma
We will not wait to see what you construct with T5Gemma. Please see the next links for more information:
- Learn concerning the research behind this project by reading the paper.
- Explore the models capabilities or fine-tune them for your personal use cases with the Colab notebook.
