Benchmarking Text Generation Inference

On this blog we can be exploring Text Generation Inference’s (TGI) little brother, the TGI Benchmarking tool. It’s going to help us understand the right way to profile TGI beyond easy throughput to higher understand the tradeoffs to make decisions on the right way to tune your deployment in your needs. If you’ve gotten ever felt like LLM deployments cost an excessive amount of or if you must tune your deployment to enhance performance this blog is for you!

I’ll show you the right way to do that in a convenient Hugging Face Space. You possibly can take the outcomes and apply it to an Inference Endpoint or other copy of the identical hardware.

Motivation

To get a greater understanding of the necessity to profile, let’s discuss some background information first.

Large Language Models (LLMs) are fundamentally inefficient. Based on the way in which decoders work, generation requires a brand new forward pass for every decoded token. As LLMs increase in size, and adoption rates surge across enterprises, the AI industry has done a fantastic job of making latest optimizations and performance enhancing techniques.

There have been dozens of improvements in lots of features of serving LLMs. We have now seen Flash Attention, Paged Attention, streaming responses, improvements in batching, speculation, quantization of many kinds, improvements in web servers, adoptions of faster languages (sorry python 🐍), and plenty of more. There are also use-case improvements like structured generation and watermarking that now have a spot within the LLM inference world. The issue is that fast and efficient implementations require increasingly area of interest skills to implement [1].

Text Generation Inference is a high-performance LLM inference server from Hugging Face designed to embrace and develop the newest techniques in improving the deployment and consumption of LLMs. On account of Hugging Face’s open-source partnerships, most (if not all) major Open Source LLMs can be found in TGI on release day.

Oftentimes users may have very different needs depending on their use-case requirements. Consider prompt and generation in a RAG use-case:

Instructions/formatting
- normally short, <200 tokens
The user query
- normally short, <200 tokens
Multiple documents
- medium-sized, 500-1000 tokens per document,
- N documents where N<10
A solution within the output
- medium-sized ~500-1000 tokens

In RAG it is important to have the precise document to get a top quality response, you increase this likelihood by increasing N which incorporates more documents. Because of this RAG will often attempt to max out an LLM’s context window to extend task performance. In contrast, take into consideration basic chat. Typical chat scenarios have significantly fewer tokens than RAG:

Multiple turns
- 2xTx50-200 tokens, for T turns
- The 2x is for each User and Assistant

On condition that we have now such different scenarios, we’d like to ensure that that we configure our LLM server accordingly depending on which one is more relevant. Hugging Face has a benchmarking tool that may help us explore what configurations make essentially the most sense and I’ll explain how you may do that on a Hugging Face Space.

Pre-requisites

Let’s ensure that we have now a standard understanding of just a few key concepts before we dive into the tool.

Latency vs Throughput


Figure 1: Latency vs Throughput Visualization

Token Latency – The period of time it takes 1 token to be processed and sent to a user
Request Latency – The period of time it takes to completely reply to a request
Time to First Token – The period of time from the initial request to the primary token returning to the user. This can be a combination of the period of time to process the prefill input and a single generated token
Throughput – The variety of tokens the server can return in a set period of time (4 tokens per second on this case)

Latency is a difficult measurement since it doesn’t inform you the entire picture. You would possibly have an extended generation or a brief one which won’t inform you much regarding your actual server performance.

It’s necessary to grasp that Throughput and Latency are orthogonal measurements, and depending on how we configure our server, we will optimize for one or the opposite. Our benchmarking tool will help us understand the trade-off via a knowledge visualization.

Pre-filling and Decoding


Figure 2: Prefilling vs Decoding inspired by [2]

Here’s a simplified view of how an LLM generates text. The model (typically) generates a single token for every forward pass. For the pre-filling stage in orange, the complete prompt (What’s.. of the US?) is shipped to the model and one token (Washington) is generated. Within the decoding stage in blue, the generated token is appended to the previous input after which this (… the capital of the US? Washington) is shipped through the model for an additional forward pass. Until the model generates the end-of-sequence-token (), this process will proceed: send input through the model, generate a token, append the token to input.

Considering Query: Why does pre-filling only take 1 pass once we are submitting multiple unseen tokens as input?

Click to disclose the reply

We don’t have to generate what comes after “What’s the”. We all know its “capital” from the user.

I only included a brief example for illustration purposes, but consider that pre-filling only needs 1 forward go through the model, but decoding can take lots of or more. Even in our short example we will see more blue arrows than orange. We will see now why it takes a lot time to get output from an LLM! Decoding is often where we spend more time considering through as a consequence of the numerous passes.

Benchmarking Tool

Motivation

We have now all seen comparisons of tools, latest algorithms, or models that show throughput. While that is a vital a part of the LLM inference story, it’s missing some key information. At a minimum (you may in fact go more in-depth) we’d like to know what the throughput AND what the latency is to make good decisions. Certainly one of the first advantages of the TGI benchmarking tool is that it has this capability.

One other necessary line of thought is considering what experience you wish the user to have. Do you care more about serving to many users, or do you wish each user once engaged together with your system to have a quick response? Do you must have a greater Time To First Token (TTFT) or do you wish blazing fast tokens to look once they get their first token even when the primary one is delayed?

Listed below are some ideas on how that may play out. Remember there is no such thing as a free lunch. But with enough GPUs and a correct configuration, you may have almost any meal you wish.

I care about…	I should deal with…
Handling more users	Maximizing Throughput
People not navigating away from my page/app	Minimizing TTFT
User Experience for a moderate amount of users	Minimizing Latency
Well rounded experience	Capping latency and maximizing throughput

Setup

The benchmarking tool is installed with TGI, but you would like access to the server to run it. With that in mind I’ve provided this space derek-thomas/tgi-benchmark-space to mix a TGI docker image (pinned to latest) and a jupyter lab working space. It’s designed to be duplicated, so dont be alarmed if it’s sleeping. It’s going to allow us to deploy a model of our selecting and simply run the benchmarking tool via a CLI. I’ve added some notebooks that may mean you can easily follow along. Be happy to dive into the Dockerfile to get a feel for the way it’s built, especially if you must tweak it.

Getting Began

Please note that it’s significantly better to run the benchmarking tool in a jupyter lab terminal reasonably than a notebook as a consequence of its interactive nature, but I’ll put the commands in a notebook so I can annotate and it is simple to follow along.

Click:
- Set your default password within the JUPYTER_TOKEN space secret (it should prompt you upon duplication)
- Select your HW, note that it should mirror the HW you must deploy on
Go to your space and login together with your password
Launch 01_1_TGI-launcher.ipynb
- This may launch TGI with default settings using the jupyter notebook
Launch 01_2_TGI-benchmark.ipynb
- This may launch the TGI benchmarking tool with some demo settings

Essential Components


Figure 3: Benchmarking Tool Components

Component 1: Batch Selector and other information.
- Use your arrows to pick different batches
Component 2 and Component 4: Pre-fill stats and histogram
- The calculated stats/histogram are based on what number of --runs
Component 3 and Component 5: Pre-fill Throughput vs Latency Scatter Plot
- X-axis is latency (small is nice)
- Y-axis is throughput (large is nice)
- The legend shows us our batch-size
- An “ideal” point could be in the highest left corner (low latency and high throughput)

Understanding the Benchmarking tool


Figure 4: Benchmarking Tool Charts

When you used the identical HW and settings I did, you must have a extremely similar chart to Figure 4. The benchmarking tool is showing us the throughput and latency for various batch sizes (amounts of user requests, barely different than the language once we are launching TGI) for the present settings and HW given once we launched TGI. This is significant to grasp as we should always update the settings in how we launch TGI based on our findings with the benchmarking tool.

The chart in Component 3 tends to be more interesting as we get longer pre-fills like in RAG. It does impact TTFT (shown on the X-axis) which is a giant a part of the user experience. Remember we get to push our input tokens through in a single forward pass even when we do should construct the KV cache from scratch. So it does are inclined to be faster in lots of cases per token than decoding.

The chart in Component 5 is once we are decoding. Let’s take a take a look at the form the information points make. We will see that for batch sizes of 1-32 the form is usually vertical at ~5.3s. This is basically good. Because of this for no degradation in latency we will improve throughput significantly! What happens at 64 and 128? We will see that while our throughput is increasing, we’re beginning to tradeoff latency.

For these same values let’s take a look at what is occurring on the chart in Component 3. For batch size 32 we will see that we’re still about 1 second for our TTFT. But we do begin to see linear growth from 32 -> 64 -> 128, 2x the batch size has 2x the latency. Further there is no such thing as a throughput gain! Because of this we do not really get much profit from the tradeoff.

Considering Questions:

What kinds of shapes do you expect these curves to take if we add more points?
How would you expect these curves to alter if you’ve gotten more tokens (pre-fill or decoding)?

In case your batch size is in a vertical area, that is great, you may get more throughput and handle more users totally free. In case your batch size is in a horizontal area, this implies you’re compute sure and increasing users just delays everyone with no good thing about throughput. It is best to improve your TGI configuration or scale your hardware.

Now that we learned a bit about TGI’s behavior in various scenarios we will try different settings for TGI and benchmark again. It’s good to undergo this cycle just a few times before deciding on an excellent configuration. If there’s enough interest possibly we will have a component 2 which dives into the optimization for a use-case like chat or RAG.

Winding Down

It is important to maintain track of actual user behavior. After we estimate user behavior we have now to start out somewhere and make educated guesses. These number selections will make a big effect on how we’re in a position to profile. Luckily TGI can tell us this information within the logs, so make sure you check that out as well.

Once you’re done together with your exploration, make sure you stop running every part so you will not incur further charges.

Kill the running cell within the TGI-launcher.ipynb jupyter notebook
Hit q within the terminal to stop the profiling tool.
Hit pause within the settings of the space

Conclusion

LLMs are bulky and expensive, but there are a lot of ways to cut back that cost. LLM inference servers like TGI have done a lot of the work for us so long as we leverage their capabilities properly. Step one is to grasp what is occurring and what trade-offs you may make. We’ve seen the right way to do this with the TGI Benchmarking tool. We will take these results and use them on any equivalent HW in AWS, GCP, or Inference Endpoints.

Due to Nicolas Patry and Olivier Dehaene for creating TGI and its benchmarking tool. Also special due to Nicholas Patry, Moritz Laurer, Nicholas Broad, Diego Maniloff, and Erik Rignér for his or her very helpful proofreading.

References

[1] : Sara Hooker, The Hardware Lottery, 2020

[2] : Pierre Lienhart, LLM Inference Series: 2. The 2-phase process behind LLMs’ responses, 2023

Source link

Benchmarking Text Generation Inference

Motivation