The Art of Prompt Design: Prompt Boundaries and Token Healing An example of a prompt boundary problem Fixing unintended bias with “token healing” What about subword regularization? Conclusion

Artificial Intelligence

The Art of Prompt Design: Prompt Boundaries and Token Healing An example of a prompt boundary problem Fixing unintended bias with “token healing” What about subword regularization? Conclusion

admin

May 12, 2023

The Art of Prompt Design: Prompt Boundaries and Token Healing
An example of a prompt boundary problem
Fixing unintended bias with “token healing”
What about subword regularization?
Conclusion

This (written jointly with Marco Tulio Ribeiro) is an element 2 of a series on (part 1 here), where we speak about controlling large language models (LLMs) with guidance.

On this post, we’ll discuss how the greedy tokenization methods utilized by language models can introduce a subtle and powerful bias into your prompts, resulting in puzzling generations.

Language models will not be trained on raw text, but fairly on tokens, that are chunks of text that always occur together, just like words. This impacts how language models ‘see’ text, including prompts (since prompts are only sets of tokens). GPT-style models utilize tokenization methods like Byte Pair Encoding (BPE), which map all input bytes to token ids in a greedy manner. That is superb for training, but it could result in subtle issues during inference, as shown in the instance below.

Consider the next example, where we are attempting to generate an HTTP URL string:

import guidance# we use StableLM for example, but these issues impact all models to various degrees
guidance.llm = guidance.llms.Transformers("stabilityai/stablelm-base-alpha-3b", device=0)
# we turn token healing off in order that guidance acts like a standard prompting library
program = guidance('The link is program()

Notebook output.

Note that the output generated by the LLM doesn’t complete the url with the apparent next characters (two forward slashes). It as an alternative creates an invalid URL string with an area in the center. That is surprising, since the // completion is amazingly obvious after http:. To grasp why this happens, let’s change our prompt boundary in order that our prompt doesn’t include the colon character:

guidance('The link is 




Now the language model generates a sound url string like we expect. To grasp why the : matters, we want to take a look at the tokenized representation of the prompts. Below is the tokenization of the prompt that ends in a colon (the prompt without the colon has the identical tokenization, apart from the last token):
print_tokens(guidance.llm.encode('The link is 




Now note what the tokenization of a sound URL looks like, paying careful attention to token 1358, right after http:
print_tokens(guidance.llm.encode('The link is 




Most LLMs (including this one) use a greedy tokenization method, all the time preferring the longest possible token, i.e. :// will all the time be preferred over : in full text (e.g. in training).
While URLs in training are encoded with token 1358 (://), our prompt makes the LLM see token 27 (:) as an alternative, which throws off completion by artificially splitting ://.
In reality, the model might be pretty sure that seeing token 27 (:) means what comes next is not possible to be anything that would have been encoded along with the colon using a “longer token” like ://, since within the model’s training data those characters would have been encoded along with the colon (an exception to this that we’ll discuss later is subword regularization during training). The incontrovertible fact that seeing a token means each seeing the embedding of that token  that whatever comes next wasn’t compressed by the greedy tokenizer is simple to forget, but it will be important in prompt boundaries.
Let’s search over the string representation of all of the tokens within the model’s vocabulary, to see which of them start with a colon:
print_tokens(guidance.llm.prefix_matches(":"))





Note that there are  different tokens starting with a colon, and thus ending a prompt with a colon means the model will likely not generate completions with any of those 34 token strings. This subtle and powerful bias can have all types of unintended consequences. And this is applicable to  string that may very well be potentially prolonged to make an extended single token (not only :). Even our “fixed” prompt ending with “http” has a inbuilt bias as well, because it communicates to the model that what comes after “http” is probably going not “s” (otherwise “http” wouldn’t have been encoded as a separate token):
print_tokens(guidance.llm.prefix_matches("http"))





Lest you’re thinking that that is an arcane problem that only touches URLs, do not forget that most tokenizers treat tokens in another way depending on whether or not they start with an area, punctuation, quotes, etc, and thus , and break things:
# Unintentionally adding an area, will result in weird generation
guidance('I read a book about {{gen max_tokens=5 token_healing=False temperature=0}}')()





# No space, works as expected
guidance('I read a book about{{gen max_tokens=5 token_healing=False temperature=0}}')()





One other example of that is the “[“ character. Consider the following prompt and completion:
guidance('An example ["like this"] and one other example [{{gen max_tokens=10 token_healing=False}}')()





Why is the second string not quoted? Because by ending our prompt with the “ [” token, we are telling the model that it should not generate completions that match the following 27 longer tokens (one of which adds the quote character, 15640):
print_tokens(guidance.llm.prefix_matches(" ["))





Token boundary bias happens everywhere. Over 70% of the 10k most-common tokens for the StableLM model used above are prefixes of longer possible tokens, and so cause token boundary bias when they are the last token in a prompt.
What can we do to avoid these unintended biases? One option is to always end our prompts with tokens that cannot be extended into longer tokens (for example a role tag for chat-based models), but this is a severe limitation.
Instead, guidance has a feature called “token healing”, which automatically backs up the generation process by one token before the end of the prompt, then constrains the first token generated to have a prefix that matches the last token in the prompt. In our URL example, this would mean removing the :, and forcing generation of the first token to have a : prefix. Token healing allows users to express prompts however they wish, without worrying about token boundaries.
For example, let’s re-run some of the URL examples above with token healing turned on (it’s on by default for Transformer models, so we remove token_healing=False):
# With token healing we generate valid URLs,
# even when the prompt ends with a colon:
guidance('The link is 




# With token healing, we will sometimes generate https URLs,
# even when the prompt ends with "http":
program = guidance('''The link is 






Similarly, we don’t should worry about extra spaces:
# Unintentionally adding an area won't impact generation
program = guidance('''I read a book about {{gen max_tokens=5 temperature=0}}''')
program()





# It will generate the identical text as above 
program = guidance('''I read a book about{{gen max_tokens=6 temperature=0}}''')
program()





And we now get quoted strings even when the prompt ends with a “ [” token:
guidance('An example ["like this"] and one other example [{{gen max_tokens=10}}')()





For those who are conversant in how language models are trained, it’s possible you’ll be wondering how subword regularization matches into all this. Subword regularization is a method where during training sub-optimal tokenizations are randomly introduced to extend the model’s robustness. Which means that the model doesn’t all the time see the perfect greedy tokenization. Subword regularization is great at helping the model be more robust to token boundaries, nevertheless it doesn’t altogether remove the bias that the model has towards the usual greedy tokenization. Which means that while depending on the quantity of subword regularization during training models may exhibit kind of token boundaries bias, all models still have this bias. And as shown above it could still have a robust and unexpected impact on the model output.

Whenever you write prompts, do not forget that greedy tokenization can have a big impact on how language models interpret your prompts, particularly when the prompt ends with a token that may very well be prolonged into an extended token. This easy-to-miss source of bias can impact your ends in surprising and unintended ways.
To deal with to this, either end your prompt with a non-extendable token, or use something like guidance’s “token healing” feature so you may to precise your prompts nonetheless you want, without worrying about token boundary artifacts.
To breed the ends in this text yourself try the notebook version.