Stop Wasting LLM Tokens

Batching your inputs together can result in substantial savings without compromising on performance

In case you use LLMs to annotate or process larger datasets, chances are high that you just’re not even realizing that you just are wasting loads of input tokens. As you repeatedly call an LLM to process text snippets or entire documents, your task instructions and static few-shot examples are repeated for every input example. Identical to neatly stacking dishes saves space, batching inputs together may end up in substantial savings.

Assume you desire to tag a smaller document corpus of 1000 single-page documents with instructions and few-shot examples which are about half a page long. Annotating each document individually would cost you about 1M input tokens. Nonetheless, should you annotated ten documents in the identical call, you’d save about 300K input tokens (or 30%) because we don’t should repeat instructions! As we’ll show in the instance below, this will often occur with minimal performance loss (and even performance gain), especially while you optimize your prompt alongside.

Below I actually have plotted the savings assuming that our average document length is D tokens and our instructions and few-shot examples have r*D tokens. The instance scenario from the previous paragraph where the instructions are half the length of the document (r = 0.5) appears in blue below. For longer shared instructions, our savings may be even higher:

The important takeaways are:

Even with relatively short instructions (blue line), there’s value in minibatching
It’s not obligatory to make use of really large minibatch sizes. Most savings may be obtained with even moderate minibatch sizes (B ≤ 10).

Let’s turn practical with a task where we wish to categorize pieces of text for further evaluation. We’ll use a fun task from the Natural-Instructions benchmark where we want to annotate sentences in debates with certainly one of 4 categories (value, fact, testimony or policy).

Taking a look at an example, we see that we get the present topic for context after which must categorize the sentence in query.

{
"input": {
"topic": "the fight for justice,equality,peaceand love is futile",
"sentence": "What matters is what I'm personally doing to be certain that I'm filling the cup!"
},
"output": "Value"
}

One query we haven’t answered yet:

How will we pick the fitting minibatch size?

Previous work has shown that the perfect minibatch size is dependent upon the duty in addition to the model. We essentially have two options:

We pick an inexpensive minibatch size, let’s say 5 and hope that we don’t see any drops.
We optimize the minibatch size together with other decisions, e.g., the variety of few-shot examples.

As you may have guessed, we’ll pursue option 2 here. To run our experiments, we’ll use SAMMO, an open-source framework for LLM calling and prompt optimization.

Prompts are coded up in SAMMO as prompt programs (that are simply nested Python classes that’ll be called with input data). We’ll structure our task into three sections and format our minibatches in JSON format.

def prompt_program(fewshot_data, n_fewshot_examples=5, minibatch_size=1):
return Output(
MetaPrompt(
[
Section("Instructions", task["Definition"]),
Section(
"Examples",
FewshotExamples(
fewshot_data, n_fewshot_examples
),
),
Section("Output in same format as above", InputData()),
],
data_formatter=JSONDataFormatter(),
render_as="markdown",
).with_extractor(on_error="empty_result"),
minibatch_size=minibatch_size,
on_error="empty_result",
)

Running this without minibatching and using five few-shot examples, we get an accuracy of 0.76 and should pay 58255 input tokens.

Let’s now explore how minibatching affects costs and performance. Since minibatching reduces the whole input costs, we are able to now use a few of those savings so as to add more few-shot examples! We will study those trade-offs by establishing a search space in SAMMO:

def search_space(fewshot_data):
minibatch_size = search_op.one_of([1, 5, 10], name="minibatch_size")
n_fewshot_examples = search_op.one_of([5, 20], name="n_fewshot")return prompt_program(fewshot_data, n_fewshot_examples, minibatch_size)

Running this shows us the total gamut of trade-offs:

  setting                                  objective    costs                              parse_errors
---------------------------------------  -----------  ---------------------------------  --------------
* {'minibatch_size': 1, 'n_fewshot': 5}    0.76         {'input': 58255, 'output': 5817}   0.0
{'minibatch_size': 1, 'n_fewshot': 20}   0.76         {'input': 133355, 'output': 6234}  0.0
{'minibatch_size': 5, 'n_fewshot': 5}    0.75         {'input': 15297, 'output': 5695}   0.0
{'minibatch_size': 5, 'n_fewshot': 20}   0.77         {'input': 30317, 'output': 5524}   0.0
{'minibatch_size': 10, 'n_fewshot': 5}   0.73         {'input': 9928, 'output': 5633}    0.0
* {'minibatch_size': 10, 'n_fewshot': 20}  0.77         {'input': 17438, 'output': 5432}   0.0

So, even with 20 few-shot examples, we save nearly 70 % input costs ([58255–17438]/58255) all while maintaining overall accuracy! As an exercise, you may implement your individual objective to robotically consider costs or include other ways of formatting the information within the search space.

Implicit in all of that is that (i) we’ve got enough input examples that use the shared instructions and (ii) we’ve got some flexibility regarding latency. The primary assumption is met in lots of annotation scenarios, but obviously doesn’t hold in one-off queries. In annotation or other offline processing tasks, latency can be not super critical as throughput matters most. Nonetheless, in case your task is to offer a user with the reply as quickly as possible, it’d make more sense to issue B parallel calls than one call with B input examples.

Stop Wasting LLM Tokens

Batching your inputs together can result in substantial savings without compromising on performance

How will we pick the fitting minibatch size?

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI Apps in a Flash with Gradio’s Reload Mode

GliNER2: Extracting Structured Information from Text

Scaling Power-Efficient AI Factories with NVIDIA Spectrum-X Ethernet Photonics

Improve Accuracy In Multimodal Search and Visual Document Retrieval with Llama Nemotron RAG Models

The Best Data Scientists Are At all times Learning

Stop Wasting LLM Tokens

Batching your inputs together can result in substantial savings without compromising on performance

How will we pick the fitting minibatch size?

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.