Bringing the Artificial Evaluation LLM Performance Leaderboard to Hugging Face

Constructing applications with LLMs requires considering greater than just quality: for a lot of use-cases, speed and price are equally or more necessary.

For consumer applications and chat experiences, speed and responsiveness are critical to user engagement. Users expect near-instant responses, and delays can directly result in reduced engagement. When constructing more complex applications involving tool use or agentic systems, speed and price change into much more necessary, and might change into the limiting factor on overall system capability. The time taken by sequential requests to LLMs can quickly stack up for every user request adding to the price.

For this reason Artificial Evaluation (@ArtificialAnlys) developed a leaderboard evaluating price, speed and quality across >100 serverless LLM API endpoints, now coming to Hugging Face.

Find the leaderboard here!

The LLM Performance Leaderboard

The LLM Performance Leaderboard goals to offer comprehensive metrics to assist AI engineers make decisions on which LLMs (each open & proprietary) and API providers to make use of in AI-enabled applications.

When making decisions regarding which AI technologies to make use of, engineers need to think about quality, price and speed (latency & throughput). The LLM Performance Leaderboard brings all three together to enable decision making in a single place across each proprietary & open models.

Source: LLM Performance Leaderboard

Metric coverage

The metrics reported are:

Quality: a simplified index for comparing model quality and accuracy, calculated based on metrics corresponding to MMLU, MT-Bench, HumanEval scores, as reported by the model authors, and Chatbot Arena rating.
Context window: the utmost variety of tokens an LLM can work with at anyone time (including each input and output tokens).
Pricing: the costs charged by a provider to question the model for inference. We report input/output per-token pricing, in addition to “blended” pricing to check hosting providers with a single metric. We mix input and output pricing at a 3:1 ratio (i.e., an assumption that the length of input is 3x longer than the output).
Throughput: how briskly an endpoint outputs tokens during inference, measured in tokens per second (also known as tokens/s or “TPS”). We report the median, P5, P25, P75 and P95 values measured over the prior 14 days.
Latency: how long the endpoint takes to reply after the request has been sent, often known as Time to First Token (“TTFT”) and measured in seconds. We report the median, P5, P25, P75 and P95 values measured over the prior 14 days.

For further definitions, see our full methodology page.

Test Workloads

The leaderboard allows exploration of performance under several different workloads (6 mixtures in total):

various the prompt length: ~100 tokens, ~1k tokens, ~10k tokens.
running parallel queries: 1 query, 10 parallel queries.

Methodology

We test every API endpoint on the leaderboard 8 times per day, and leaderboard figures represent the median measurement of the last 14 days. We even have percentile breakdowns throughout the collapsed tabs.

Quality metrics are currently collected on a per-model basis and show results reports by model creators, but watch this space as we start to share results from our independent quality evaluations across each endpoint.

For further definitions, see our full methodology page.

Highlights (May 2024, see the leaderboard for the most recent)

The language models market has exploded in complexity over the past yr. Launches which have shaken up the market just throughout the last two months include proprietary models like Anthropic’s Claude 3 series and open models corresponding to Databricks’ DBRX, Cohere’s Command R Plus, Google’s Gemma, Microsoft’s Phi-3, Mistral’s Mixtral 8x22B and Meta’s Llama 3.
Price and speed vary considerably between models and providers. From Claude 3 Opus to Llama 3 8B, there’s a 300x pricing spread – that is greater than two orders of magnitude!
API providers have increased the speed of launching models. Inside 48 hours, 7 providers were offering the Llama 3 models. Chatting with the demand for brand spanking new, open-source models and the competitive dynamics between API providers.
Key models to focus on across quality segments:
- Prime quality, typically higher price & slower: GPT-4 Turbo and Claude 3 Opus
- Moderate quality, price & speed: Llama 3 70B, Mixtral 8x22B, Command R+, Gemini 1.5 Pro, DBRX
- Lower quality, but with much faster speed and lower pricing available: Llama 3 8B, Claude 3 Haiku, Mixtral 8x7B

Our chart of Quality vs. Throughput (tokens/s) shows the range of options with different quality and performance characteristics.

Source: artificialanalysis.ai/models

Use Case Example: Speed and Price might be as necessary as Quality

In some cases, design patterns involving multiple requests with faster and cheaper models can lead to not only lower cost but higher overall system quality in comparison with using a single larger model.

For instance, consider a chatbot that should browse the online to seek out relevant information from recent news articles. One approach can be to make use of a big, high-quality model like GPT-4 Turbo to run a search then read and process the highest handful of articles. One other can be to make use of a smaller, faster model like Llama 3 8B to read and extract highlights from dozens web pages in parallel, after which use GPT-4 Turbo to evaluate and summarize essentially the most relevant results. The second approach shall be more economical, even after accounting for reading 10x more content, and should lead to higher quality results.

Get in contact

Please follow us on Twitter and LinkedIn for updates. We’re available via message on either, in addition to on our website and via email.

Source link

Bringing the Artificial Evaluation LLM Performance Leaderboard to Hugging Face