The Art of Prompt Design: Prompt Boundaries and Token Healing
An example of a prompt boundary problem
Fixing unintended bias with “token healing”
What about subword regularization?
Conclusion
This (written jointly with Marco Tulio Ribeiro) is a component 2 of a series on (part 1 here), where we discuss controlling large language models (LLMs) with guidance.
On this post, we’ll discuss how the greedy tokenization methods utilized by language models can introduce a subtle and powerful bias into your prompts, resulting in puzzling generations.
Language models usually are not trained on raw text, but quite on tokens, that are chunks of text that always occur together, just like words. This impacts how language models ‘see’ text, including prompts (since prompts are only sets of tokens). GPT-style models utilize tokenization methods like Byte Pair Encoding (BPE), which map all input bytes to token ids in a greedy manner. That is nice for training, but it could possibly result in subtle issues during inference, as shown in the instance below.
Consider the next example, where we are attempting to generate an HTTP URL string:
import guidance
# we use StableLM for instance, but these issues impact all models to various degrees guidance.llm = guidance.llms.Transformers("stabilityai/stablelm-base-alpha-3b", device=0)
# we turn token healing off in order that guidance acts like a traditional prompting library program = guidance('The link is program()
Notebook output.
Note that the output generated by the LLM doesn’t complete the url with the plain next characters (two forward slashes). It as an alternative creates an invalid URL string with an area in the center. That is surprising, since the // completion is amazingly obvious after http:. To know why this happens, let’s change our prompt boundary in order that our prompt doesn’t include the colon character:
While you write prompts, keep in mind that greedy tokenization can have a big impact on how language models interpret your prompts, particularly when the prompt ends with a token that could possibly be prolonged into an extended token. This easy-to-miss source of bias can impact your ends in surprising and unintended ways.
To handle to this, either end your prompt with a non-extendable token, or use something like guidance’s “token healing” feature so you may to specific your prompts nonetheless you want, without worrying about token boundary artifacts.
To breed the ends in this text yourself try the notebook version.