Load-Testing LLMs Using LLMPerf

-

Language Model (LLM) shouldn’t be necessarily the ultimate step in productionizing your Generative AI application. An often forgotten, yet crucial a part of the MLOPs lifecycle is correctly load testing your LLM and ensuring it is prepared to face up to your expected production traffic. Load testing at a high level is the practice of testing your application or on this case your model with the traffic it will expect in a production environment to be sure that it’s performant.

Prior to now we’ve discussed load testing traditional ML models using open source Python tools reminiscent of Locust. Locust helps capture general performance metrics reminiscent of requests per second (RPS) and latency percentiles on a per request basis. While that is effective with more traditional APIs and ML models it doesn’t capture the total story for LLMs. 

LLMs traditionally have a much lower RPS and better latency than traditional ML models as a consequence of their size and bigger compute requirements. Normally the RPS metric does not likely provide probably the most accurate picture either as requests can greatly vary depending on the input to the LLM. For example you may have a question asking to summarize a big chunk of text and one other query that may require a one-word response. 

That is why tokens are seen as a far more accurate representation of an LLM’s performance. At a high level a token is a bit of text, each time an LLM is processing your input it “tokenizes” the input. A token differs depending specifically on the LLM you might be using, but you possibly can imagine it as an example as a word, sequence of words, or characters in essence.

Image by Writer

What we’ll do in this text is explore how we will generate token based metrics so we will understand how your LLM is acting from a serving/deployment perspective. After this text you’ll have an idea of how you possibly can arrange a load-testing tool specifically to benchmark different LLMs within the case that you simply are evaluating many models or different deployment configurations or a mix of each.

Let’s get hands on! For those who are more of a video based learner be at liberty to follow my corresponding YouTube video down below:

NOTE: This text assumes a basic understanding of Python, LLMs, and Amazon Bedrock/SageMaker. For those who are latest to Amazon Bedrock please confer with my starter guide here. If you wish to learn more about SageMaker JumpStart LLM deployments confer with the video here.

DISCLAIMER: I’m a Machine Learning Architect at AWS and my opinions are my very own.

Table of Contents

  1. LLM Specific Metrics
  2. LLMPerf Intro
  3. Applying LLMPerf to Amazon Bedrock
  4. Additional Resources & Conclusion

LLM-Specific Metrics

As we briefly discussed within the introduction with reference to LLM hosting, token based metrics generally provide a significantly better representation of how your LLM is responding to different payload sizes or varieties of queries (summarization vs QnA). 

Traditionally we’ve got all the time tracked RPS and latency which we are going to still see here still, but more so at a token level. Listed below are a few of the metrics to pay attention to before we start with load testing:

  1. Time to First Token: That is the duration it takes for the primary token to generate. This is particularly handy when streaming. For example when using ChatGPT we start processing information when the primary piece of text (token) appears.
  2. Total Output Tokens Per Second: That is the overall variety of tokens generated per second, you possibly can consider this as a more granular alternative to the requests per second we traditionally track.

These are the foremost metrics that we’ll deal with, and there’s a couple of others reminiscent of inter-token latency that can even be displayed as a part of the load tests. Have in mind the parameters that also influence these metrics include the expected input and output token size. We specifically play with these parameters to get an accurate understanding of how our LLM performs in response to different generation tasks. 

Now let’s take a take a look at a tool that permits us to toggle these parameters and display the relevant metrics we’d like.

LLMPerf Intro

LLMPerf is built on top of Ray, a well-liked distributed computing Python framework. LLMPerf specifically leverages Ray to create distributed load tests where we will simulate real-time production level traffic. 

Note that any load-testing tool can also be only going to give you the option to generate your expected amount of traffic if the client machine it’s on has enough compute power to match your expected load. For example as you scale the concurrency or throughput expected to your model, you’d also need to scale the client machine(s) where you might be running your load test.

Now specifically inside LLMPerf there’s a couple of parameters which are exposed which are tailored for LLM load testing as we’ve discussed:

  • Model: That is the model provider and your hosted model that you simply’re working with. For our use-case it’ll be Amazon Bedrock and Claude 3 Sonnet specifically.
  • LLM API: That is the API format by which the payload must be structured. We use LiteLLM which provides a standardized payload structure across different model providers, thus simplifying the setup process for us especially if we wish to check different models hosted on different platforms.
  • Input Tokens: The mean input token length, you can even specify a regular deviation for this number.
  • Output Tokens: The mean output token length, you can even specify a regular deviation for this number.
  • Concurrent Requests: The variety of concurrent requests for the load test to simulate.
  • Test Duration: You may control the duration of the test, this parameter is enabled in seconds.

LLMPerf specifically exposes all these parameters through their token_benchmark_ray.py script which we configure with our specific values. Let’s have a look now at how we will configure this specifically for Amazon Bedrock.

Applying LLMPerf to Amazon Bedrock

Setup

For this instance we’ll be working in a SageMaker Classic Notebook Instance with a conda_python3 kernel and ml.g5.12xlarge instance. Note that you wish to select an instance that has enough compute to generate the traffic load that you wish to simulate. Be certain that you furthermore mght have your AWS credentials for LLMPerf to access the hosted model be it on Bedrock or SageMaker.

LiteLLM Configuration

We first configure our LLM API structure of selection which is LiteLLM on this case. With LiteLLM there’s support across various model providers, on this case we configure the completion API to work with Amazon Bedrock:

import os
from litellm import completion

os.environ["AWS_ACCESS_KEY_ID"] = "Enter your access key ID"
os.environ["AWS_SECRET_ACCESS_KEY"] = "Enter your secret access key"
os.environ["AWS_REGION_NAME"] = "us-east-1"

response = completion(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    messages=[{ "content": "Who is Roger Federer?","role": "user"}]
)
output = response.selections[0].message.content
print(output)

To work with Bedrock we configure the Model ID to point towards Claude 3 Sonnet and pass in our prompt. The neat part with LiteLLM is that messages key has a consistent format across model providers.

Post-execution here we will deal with configuring LLMPerf for Bedrock specifically.

LLMPerf Bedrock Integration

To execute a load test with LLMPerf we will simply use the provided token_benchmark_ray.py script and pass in the next parameters that we talked of earlier:

  • Input Tokens Mean & Standard Deviation
  • Output Tokens Mean & Standard Deviation
  • Max variety of requests for test
  • Duration of test
  • Concurrent requests

On this case we also specify our API format to be LiteLLM and we will execute the load test with a straightforward shell script like the next:

%%sh
python llmperf/token_benchmark_ray.py 
    --model bedrock/anthropic.claude-3-sonnet-20240229-v1:0 
    --mean-input-tokens 1024 
    --stddev-input-tokens 200 
    --mean-output-tokens 1024 
    --stddev-output-tokens 200 
    --max-num-completed-requests 30 
    --num-concurrent-requests 1 
    --timeout 300 
    --llm-api litellm 
    --results-dir bedrock-outputs

On this case we keep the concurrency low, but be at liberty to toggle this number depending on what you’re expecting in production. Our test will run for 300 seconds and post duration it is best to see an output directory with two files representing statistics for every inference and likewise the mean metrics across all requests within the duration of the test.

We will make this look a bit neater by parsing the summary file with pandas:

import json
from pathlib import Path
import pandas as pd

# Load JSON files
individual_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_individual_responses.json")
summary_path = Path("bedrock-outputs/bedrock-anthropic-claude-3-sonnet-20240229-v1-0_1024_1024_summary.json")

with open(individual_path, "r") as f:
    individual_data = json.load(f)

with open(summary_path, "r") as f:
    summary_data = json.load(f)

# Print summary metrics
df = pd.DataFrame(individual_data)
summary_metrics = {
    "Model": summary_data.get("model"),
    "Mean Input Tokens": summary_data.get("mean_input_tokens"),
    "Stddev Input Tokens": summary_data.get("stddev_input_tokens"),
    "Mean Output Tokens": summary_data.get("mean_output_tokens"),
    "Stddev Output Tokens": summary_data.get("stddev_output_tokens"),
    "Mean TTFT (s)": summary_data.get("results_ttft_s_mean"),
    "Mean Inter-token Latency (s)": summary_data.get("results_inter_token_latency_s_mean"),
    "Mean Output Throughput (tokens/s)": summary_data.get("results_mean_output_throughput_token_per_s"),
    "Accomplished Requests": summary_data.get("results_num_completed_requests"),
    "Error Rate": summary_data.get("results_error_rate")
}
print("Claude 3 Sonnet - Performance Summary:n")
for k, v in summary_metrics.items():
    print(f"{k}: {v}")

The ultimate load test results will look something like the next:

Screenshot by Writer

As we will see we see the input parameters that we configured, after which the corresponding results with time to first token(s) and throughput with reference to mean output tokens per second.

In a real-world use case you may use LLMPerf across many various model providers and run tests across these platforms. With this tool you should use it holistically to discover the fitting model and deployment stack to your use-case when used at scale.

Additional Resources & Conclusion

Your entire code for the sample will be found at this associated Github repository. For those who also need to work with SageMaker endpoints you will discover a Llama JumpStart deployment load testing sample here.

All in all load testing and evaluation are each crucial to making sure that your LLM is performant against your expected traffic before pushing to production. In future articles we’ll cover not only the evaluation portion, but how we will create a holistic test with each components.

As all the time thanks for reading and be at liberty to depart any feedback and connect with me on Linkedln and X.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x