An End to the Silent Performance Killer

A spectre is haunting chat models – the spectre of incorrect formatting!

tl;dr

Chat models have been trained with very different formats for converting conversations right into a single tokenizable string. Using a format different from the format a model was trained with will normally cause severe, silent performance degradation, so matching the format used during training is incredibly vital! Hugging Face tokenizers now have a chat_template attribute that might be used to avoid wasting the chat format the model was trained with. This attribute accommodates a Jinja template that converts conversation histories right into a accurately formatted string. Please see the technical documentation for information on tips on how to write and apply chat templates in your code.

Introduction

For those who’re acquainted with the 🤗 Transformers library, you have probably written code like this:

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModel.from_pretrained(checkpoint)

By loading the tokenizer and model from the identical checkpoint, you make sure that inputs are tokenized
in the way in which the model expects. For those who pick a tokenizer from a special model, the input tokenization
could be completely different, and the result might be that your model’s performance might be seriously damaged. The term for it is a distribution shift – the model has been learning data from one distribution (the tokenization it was trained with), and suddenly it has shifted to a very different one.

Whether you are fine-tuning a model or using it directly for inference, it is usually a very good idea to attenuate these distribution shifts and keep the input you give it as similar as possible to the input it was trained on. With regular language models, it’s relatively easy to do this – simply load your tokenizer and model from the identical checkpoint, and also you’re good to go.

With chat models, nonetheless, it’s kind of different. It is because “chat” just isn’t only a single string of text that might be straightforwardly tokenized – it is a sequence of messages, each of which accommodates a role in addition to content, which is the actual text of the message. Mostly, the roles are “user” for messages sent by the user, “assistant” for responses written by the model, and optionally “system” for high-level directives given firstly of the conversation.

If that each one seems a bit abstract, here’s an example chat to make it more concrete:

[
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"}
]

This sequence of messages must be converted right into a text string before it could actually be tokenized and used as input to a model. The issue, though, is that there are a lot of ways to do that conversion! You would, for instance, convert the list of messages into an “quick messenger” format:

User: Hey there!
Bot: Nice to satisfy you!

Or you may add special tokens to point the roles:

[USER] Hey there! [/USER]
[ASST] Nice to satisfy you! [/ASST]

Or you may add tokens to point the boundaries between messages, but insert the role information as a string:

<|im_start|>user
Hey there!<|im_end|>
<|im_start|>assistant
Nice to satisfy you!<|im_end|>

There are a lot of ways to do that, and none of them is clearly one of the best or correct strategy to do it. Because of this, different models have been trained with wildly different formatting. I didn’t make these examples up; they’re all real and getting used by a minimum of one lively model! But once a model has been trained with a certain format, you really need to make sure that future inputs use the identical format, or else you may get a performance-destroying distribution shift.

Templates: A strategy to save format information

Straight away, should you’re lucky, the format you would like is accurately documented somewhere within the model card. For those who’re unlucky, it is not, so good luck if you desire to use that model. In extreme cases, we have even put the entire prompt format in a blog post to make sure that users don’t miss it! Even within the best-case scenario, though, you could have to locate the template information and manually code it up in your fine-tuning or inference pipeline. We predict that is an especially dangerous issue because using the mistaken chat format is a silent error – you will not get a loud failure or a Python exception to let you know something is mistaken, the model will just perform much worse than it might have with the proper format, and it will be very difficult to debug the cause!

That is the issue that chat templates aim to unravel. Chat templates are Jinja template strings which are saved and loaded together with your tokenizer, and that contain all the data needed to show an inventory of chat messages right into a accurately formatted input in your model. Listed below are three chat template strings, corresponding to the three message formats above:

{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ "User : " }}
    {% else %}
        {{ "Bot : " }}
    {{ message['content'] + 'n' }}
{% endfor %}

{% for message in messages %}
    {% if message['role'] == 'user' %}
        {{ "[USER] " + message['content'] + " [/USER]" }}
    {% else %}
        {{ "[ASST] " + message['content'] + " [/ASST]" }}
    {{ message['content'] + 'n' }}
{% endfor %}

"{% for message in messages %}"  
    "{im_start}"  
"{% endfor %}"

For those who’re unfamiliar with Jinja, I strongly recommend that you simply take a moment to take a look at these template strings, and their corresponding template outputs, and see should you can persuade yourself that you simply understand how the template turns an inventory of messages right into a formatted string! The syntax could be very just like Python in lots of ways.

Why templates?

Although Jinja might be confusing at first should you’re unfamiliar with it, in practice we discover that Python programmers can pick it up quickly. During development of this feature, we considered other approaches, reminiscent of a limited system to permit users to specify per-role prefixes and suffixes for messages. We found that this might develop into confusing and unwieldy, and was so inflexible that hacky workarounds were needed for several models. Templating, then again, is powerful enough to cleanly support all the message formats that we’re aware of.

Why hassle doing this? Why not only pick a typical format?

This is a superb idea! Unfortunately, it’s too late, because multiple vital models have already been trained with very different chat formats.

Nevertheless, we will still mitigate this problem a bit. We predict the closest thing to a ‘standard’ for formatting is the ChatML format created by OpenAI. For those who’re training a brand new model for chat, and this format is suitable for you, we recommend using it and adding special <|im_start|> and <|im_end|> tokens to your tokenizer. It has the advantage of being very flexible with roles, because the role is just inserted as a string moderately than having specific role tokens. For those who’d prefer to use this one, it is the third of the templates above, and you possibly can set it with this easy one-liner:

tokenizer.chat_template = "{% for message in messages %}{im_start}{% endfor %}"

There’s also a second reason to not hardcode a typical format, though, beyond the proliferation of existing formats – we expect that templates might be broadly useful in preprocessing for a lot of forms of models, including people who could be doing very various things from standard chat. Hardcoding a typical format limits the power of model developers to make use of this feature to do things we’ve not even considered yet, whereas templating gives users and developers maximum freedom. It’s even possible to encode checks and logic in templates, which is a feature we do not use extensively in any of the default templates, but which we expect to have enormous power within the hands of adventurous users. We strongly imagine that the open-source ecosystem should enable you to do what you would like, not dictate to you what you are permitted to do.

How do templates work?

Chat templates are a part of the tokenizer, because they fulfill the identical role as tokenizers do: They store details about how data is preprocessed, to make sure that you feed data to the model in the identical format that it saw during training. We have now designed it to be very easy so as to add template information to an existing tokenizer and reserve it or upload it to the Hub.

Before chat templates, chat formatting information was stored on the class level – this meant that, for instance, all LLaMA checkpoints would get the identical chat formatting, using code that was hardcoded in transformers for the LLaMA model class. For backward compatibility, model classes that had custom chat format methods have been given default chat templates as a substitute.

Default chat templates are also set at the category level, and tell classes like ConversationPipeline tips on how to format inputs when the model doesn’t have a chat template. We’re doing this purely for backwards compatibility – we highly recommend that you simply explicitly set a chat template on any chat model, even when the default chat template is suitable. This ensures that any future changes or deprecations within the default chat template don’t break your model. Although we might be keeping default chat templates for the foreseeable future, we hope to transition all models to explicit chat templates over time, at which point the default chat templates could also be removed entirely.

For details about tips on how to set and apply chat templates, please see the technical documentation.

How do I start with templates?

Easy! If a tokenizer has the chat_template attribute set, it’s able to go. You should use that model and tokenizer in ConversationPipeline, or you possibly can call tokenizer.apply_chat_template() to format chats for inference or training. Please see our developer guide or the apply_chat_template documentation for more!

If a tokenizer doesn’t have a chat_template attribute, it’d still work, but it’s going to use the default chat template set for that model class. That is fragile, as we mentioned above, and it is also a source of silent bugs when the category template doesn’t match what the model was actually trained with. If you desire to use a checkpoint that does not have a chat_template, we recommend checking docs just like the model card to confirm what the proper format is, after which adding an accurate chat_templatefor that format. We recommend doing this even when the default chat template is correct – it future-proofs the model, and in addition makes it clear that the template is present and suitable.

You’ll be able to add a chat_template even for checkpoints that you simply’re not the owner of, by opening a pull request. The one change you must make is to set the tokenizer.chat_template attribute to a Jinja template string. Once that is done, push your changes and also you’re able to go!

For those who’d prefer to use a checkpoint for chat but you possibly can’t find any documentation on the chat format it used, you must probably open a problem on the checkpoint or ping the owner! When you determine the format the model is using, please open a pull request so as to add an appropriate chat_template. Other users will really appreciate it!

Conclusion: Template philosophy

We predict templates are a really exciting change. Along with resolving an enormous source of silent, performance-killing bugs, we predict they open up completely recent approaches and data modalities. Perhaps most significantly, in addition they represent a philosophical shift: They take a giant function out of the core transformers codebase and move it into individual model repos, where users have the liberty to do weird and wild and wonderful things. We’re excited to see what uses you discover for them!

Source link

An End to the Silent Performance Killer

tl;dr

Introduction

Templates: A strategy to save format information

Why templates?

Why hassle doing this? Why not only pick a typical format?

How do templates work?

How do I start with templates?

Conclusion: Template philosophy

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

An End to the Silent Performance Killer

tl;dr

Introduction

Templates: A strategy to save format information

Why templates?

Why hassle doing this? Why not only pick a typical format?

How do templates work?

How do I start with templates?

Conclusion: Template philosophy

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.