The Art of Prompt Design: Prompt Boundaries and Token Healing
An example of a prompt boundary problem
Fixing unintended bias with “token healing”
What about subword regularization?
Conclusion
This (written jointly with Marco Tulio Ribeiro) is an element 2 of a series on (part 1 here), where we speak about controlling large language models (LLMs) with guidance.
On this post, we’ll discuss how the greedy tokenization methods utilized by language models can introduce a subtle and powerful bias into your prompts, resulting in puzzling generations.
Language models will not be trained on raw text, but fairly on tokens, that are chunks of text that always occur together, just like words. This impacts how language models ‘see’ text, including prompts (since prompts are only sets of tokens). GPT-style models utilize tokenization methods like Byte Pair Encoding (BPE), which map all input bytes to token ids in a greedy manner. That is superb for training, but it could result in subtle issues during inference, as shown in the instance below.
Consider the next example, where we are attempting to generate an HTTP URL string:
import guidance
# we use StableLM for example, but these issues impact all models to various degrees guidance.llm = guidance.llms.Transformers("stabilityai/stablelm-base-alpha-3b", device=0)
# we turn token healing off in order that guidance acts like a standard prompting library program = guidance('The link is program()
Notebook output.
Note that the output generated by the LLM doesn’t complete the url with the apparent next characters (two forward slashes). It as an alternative creates an invalid URL string with an area in the center. That is surprising, since the // completion is amazingly obvious after http:. To grasp why this happens, let’s change our prompt boundary in order that our prompt doesn’t include the colon character:
Whenever you write prompts, do not forget that greedy tokenization can have a big impact on how language models interpret your prompts, particularly when the prompt ends with a token that may very well be prolonged into an extended token. This easy-to-miss source of bias can impact your ends in surprising and unintended ways.
To deal with to this, either end your prompt with a non-extendable token, or use something like guidance’s “token healing” feature so you may to precise your prompts nonetheless you want, without worrying about token boundary artifacts.
To breed the ends in this text yourself try the notebook version.