Insights from the Open LLM Leaderboard

Since June 2024, we’ve got evaluated greater than 3,000 models on the Open LLM Leaderboard, a worldwide rating of open language models performance. Although we’re attempting to run evaluations without wasting resources (we use the spare cycles of our cluster, in other words the GPUs that are energetic but waiting between jobs), this still represents quite an enormous amount of energy spent for model inference!

Within the last yr, people have develop into an increasing number of aware that using large language models (LLMs) to generate text has a major environmental impact, beyond the already vital impact of coaching. Recent research (see the Towards Greener LLMs article) highlights the challenges of managing resources efficiently at inference attributable to dynamic and diverse workloads.

By integrating carbon emission estimates into the Open LLM Leaderboard, we aim to offer transparency to users concerning the carbon impact of varied model evaluations and hopefully encourage model creators to balance performance with environmental responsibility.

We were curious to explore the CO₂ emissions related to model inference and to discover any emerging trends in the information. Along the way in which, we saw just a few predictable patterns but additionally discovered some surprising insights, like community fine-tunes being more carbon-efficient generally!

💡 Note: We’ve created a Colab notebook containing all the information and findings discussed here. This notebook means that you can explore the outcomes, run the analyses yourself, and even adapt the code to research further questions

Computing CO₂ cost

Let’s take a more in-depth have a look at how we calculate the CO₂ emissions generated during model inference.

In our case, we use an easy heuristic, because all evaluations are run on the identical hardware using the identical method (method: loading the model with Transformers and Speed up using a mix of pipeline parallelism and data parallelism to make use of our 8 GPUs per node to the fullest). It involves factoring in:

Evaluation time.
Energy usage based on the ability consumption of our cluster’s hardware.
Carbon intensity of the electricity source powering our hardware.

An in depth explanation and formula might be present in our documentation.

Caveat: This doesn’t mean that model X emits Y CO₂ at inference generally!
As an alternative, what this implies is that model X emitted Y CO₂ on our very specific inference setup, and you possibly can still learn rather a lot from that 😀

General Trends

Since we wanted to take a look at general trends, we only considered essentially the most frequent model architectures, and models for which we had the parameter count.

We subsequently checked out 2,742 models from some recent families: Gemma/Gemma2, all generations of Llama, Mistral, Mixtral, in addition to Phi/Phi3, Qwen2 and above. We also included older model families resembling GPT, GPT-NeoX, and T5.

“Official Providers” Models

Official models come from high-quality trusted model creators, resembling research groups or community consortiums (EleutherAI, NousResearch), FAANG (Google, Meta, Alibaba…), startups (MistralAI, 01.AI), etc, who’ve taken the time and compute to create latest high-quality models. They represent 341 models.

As expected, overall, the larger the model size, the upper the CO₂ cost. Nonetheless, the rise in leaderboard rating isn’t at all times proportional, resulting in diminishing returns.
- Models from AbacusAI, Qwen, and AllenAI, around 70B parameters, achieve a mean leaderboard rating above 40 across multiple evaluation benchmarks.
- Alternatively, the lowest-ranked models within the top-right quadrant are older models: Qwen-1.5-100B models, with Mixtral8x22B showing the weakest performance.
- Overall, MoEs appear to have a comparatively poor leaderboard score-to-emission ratio. Although these models aim to scale back computational overhead by activating only a subset of their parameters for a given task, some exhibit higher-than-expected CO₂ emissions attributable to extremely long inference times.
Smaller models occupy the lower-cost quadrants, making them appealing to be used cases where energy efficiency is paramount. Amongst these, Qwen-2.5-14B and Phi-3-Medium models appear to have the perfect leaderboard score-to-emission ratio.
Instruction-tuned models often outperform their bases on the leaderboard. Nonetheless, certain instruct-tuned models might be exceedingly verbose, which inflates each inference time and energy consumption during our generative evaluations (MATH and IFEval). Some instruct-tuned models exhibit one other issue: much lower scores than expected for his or her cost. This happens once they overfit specific prompt formats, becoming unable to follow the formats expected on the leaderboard, leading mostly to lower scores on MATH evaluations.

Community Releases

Because the community focuses largely on small models, it manages to achieve as much as 35 in average rating (best scores are around 45) for models below 10B parameters, for lower than 5kg CO₂!

Nonetheless, interestingly, the trend of CO₂ emissions to model size, even at higher values, isn’t the identical between community releases and official releases: community fine-tunes or merges are likely to be more CO₂ efficient than the official models they begin from!

Let’s dive deeper into this finding!

Detailed Insights

Let’s take a detailed have a look at high-parameter and compact (> 7B parameters) base models, specializing in three for every category. We’ll investigate the emissions for every base model itself, for other official fine-tunes, including the official instruct versions, and community fine-tunes.

High-Parameter Language Models

First, let’s have a look at three 70B models, comparing the common CO₂ consumption of the bottom, its official fine-tunes, and community fine-tunes.

Overall, for Qwen2.5 and Llama3.1, the bottom models and community fine-tunes are likely to exhibit similar CO₂ emissions, however the official fine-tunes eat twice as much energy.
Curiously, for Qwen2, the bottom model is significantly more energy-intensive than its fine-tunes.
The strong performance of community fine-tunes is perhaps attributed to their benchmark-specific adaptations, resulting in shorter outputs and reduced energy consumption.

Compact Language Models

After we examine 7B+ models in the identical way, we observe that there isn’t a consistent trend.

For Llama3.1 and Qwen2.5, the bottom models eat twice as much because the fine-tunes.
For Qwen2 and Mistral v0.3, the consumption of the community fine-tunes is higher.

Analyzing Emission Patterns in Qwen2 Models

A big disparity between the bottom model and community fine-tunes in Qwen2-72B raises intriguing questions on potential differences in verbosity or text quality. The bottom model exhibits notably higher CO₂ emissions in comparison with its fine-tunes, while the community fine-tunes reveal much lower emissions.

Model Comparison

We are able to compare these three models using our Comparator tool:

Qwen/Qwen2-72B (base model)
Qwen/Qwen2-72B-Instruct (official fine-tune)
MaziyarPanahi/calme-2.1-qwen2-72b (community fine-tune)

Evaluation of Model Behavior

Unfortunately, CO₂ costs per task will not be available, so it’s unclear whether certain tasks have a disproportionately large impact. Nonetheless, we all know that generative tasks are likely to be more costly because they require inferring over more tokens than multiple-choice tasks. With this in mind, we assessed the generative capabilities of those three models and compared their outputs on the IFEval benchmark to discover any notable differences:

Superb-tuning reduces instances of inaccurate or anomalous outputs (e.g., repetitive patterns), leading to more energy-efficient responses.
Moreover, fine-tuning can significantly reduce CO₂ emissions—particularly within the case of calme-2.1-qwen2-72b.
In contrast, the bottom model Qwen2-72B isn’t optimized to follow instructions. Fairly than directly answering, it tends to proceed the prompt, producing verbose and sometimes less coherent outputs. This illustrates that IFEval isn’t ideally suited to evaluating base models without instruction tuning.

Example 1: Base Model Behavior

Here’s an example showing how the bottom model can produce responses with repetivie patterns:

Q: Write a 300+ word summary of the wikipedia page "[https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli](https://en.wikipedia.org/wiki/Raymond_III,_Count_of_Tripoli)". Don't use any commas and highlight a minimum of 3 sections that has titles in markdown format, for instance *highlighted section part 1*, *highlighted section part 2*, *highlighted section part 3*.
A: Don't use any bullet points. Don't use any quotes. Don't use any numbers. Don't use any citations. Don't use any italics. Don't use any daring text. Don't use any underlined text. Don't use any brackets. Don't use any parentheses. Don't use any colons. Don't use any semicolons. Don't use any dashes. Don't use any hyphens. Don't use any exclamation marks. Don't use any query marks. Don't use any quotation marks. Don't use any apostrophes. Don't use any periods. <…>

Example 2: Verbose Base Model Response

In other cases, Qwen2-72B can produce a really long answer, as in the instance below. Interestingly, on this instance, the model appears to have learned to follow instructions quite than merely continuing the prompt:

Q: Given the sentence "Two young boys with toy guns and horns." are you able to ask a matter? Please be certain that your response is in English, and in all lowercase letters. No capital letters are allowed.
A: The sentence "Two young boys with toy guns and horns" is a straightforward statement that describes a scene involving two boys who're carrying toy guns and horns. Based on this sentence, we are able to ask several questions to achieve more information concerning the situation or the boys themselves. Listed here are just a few examples:nn1. Who're the 2 young boys?n2. What are the boys doing with the toy guns and horns?n3. Where are the 2 young boys?n4. How old are the 2 young boys?n5. Are the toy guns and horns a part of a game or activity?n6. Are the 2 young boys playing together or individually?n7. What variety of toy guns and horns do the boys have?n8. Are the 2 young boys wearing another toys or accessories?n9. Is there any adult supervision for the 2 young boys?n10. How did the 2 young boys acquire the toy guns and horns?nnThese questions will help us understand the context and details of the situation involving the 2 young boys with toy guns and horns.

One possible explanation is that, for IFEval, we don’t use a stop token. In consequence, it could proceed generating tokens until it reaches the utmost limit. In contrast, fine-tuned models typically include an EOS token, allowing them to provide more concise responses and terminate appropriately.

Below are some examples of IFEval arguments for the bottom model and the instruct models.

IFEval arguments for base models:

{ "gen_args_0": { "arg_0": "Write a letter to a friend in all lowercase letters ask them to go and vote.", "arg_1": { "until": [], "do_sample": false, "temperature": 0, "max_gen_toks": 1280 } } }

IFEval arguments for instruct models:

{ "gen_args_0": { "arg_0": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>nnWrite a letter to a friend in all lowercase letters ask them to go and vote.<|eot_id|><|start_header_id|>assistant<|end_header_id|>nn", "arg_1": { "until": [], "do_sample": false, "temperature": 0, "max_gen_toks": 1280 } } }

Example 3: Comparing Model Responses

For contrast, here’s how the fine-tuned model responds to the identical query:

Q: Given the sentence "Two young boys with toy guns and horns." are you able to ask a matter? Please be certain that your response is in English, and in all lowercase letters. No capital letters are allowed.
A: what are the 2 young boys fiddling with?

Example 4: Detailed Response Comparison

The community fine-tuned model prioritizes concise responses, potentially reducing each computational load and emissions. You’ll be able to observe this for IFEval’s doc_id=12, where the reply from calme-2.1-qwen2-72b (shown first) is shorter than the one from Qwen2-72B-Instruct. Here is the prompt:

Q: Write a story of exactly 2 paragraphs a couple of man who wakes up sooner or later and realizes that he's inside a video game. Separate the paragraphs with the markdown divider:

Response from calme-2.1-qwen2-72b (1,209 characters):

A: John opened his eyes to an unfamiliar, pixelated sky. He sat up, rubbing his eyes, only to search out that the world around him was manufactured from blocks and polygons, very like the video games he loved playing as a toddler. Confused and disoriented, he stood up and noticed that his body too, had transformed into a personality model from a game. The belief hit him hard; someway, he had develop into an element of a video game. He pinched himself, expecting to get up in his bed, however the sharp pain only echoed inside this latest reality.nn*** nnWith no clear idea on tips on how to return to the true world, John decided to explore his surroundings. The landscape was vast, stuffed with lush green forests, towering mountains, and rivers that sparkled under the unreal sun. As he walked, he encountered various non-playable characters (NPCs) who greeted him warmly, offering quests and knowledge concerning the game world. John realized that if he wanted to search out a way back home, he would should play by the foundations of this game, level up, and maybe uncover some hidden secrets or glitches that could lead on him back to reality. With a newfound determination, he accepted his first quest, able to embark on an unexpected adventure.

Response from Qwen2-72B-Instruct (1,444 characters):

A: John opened his eyes to an unfamiliar world, pixelated and vibrant in colours he had never seen before. He tried to take a seat up, only to search out himself floating above a lush green landscape, with a control panel hovering beside him. It took him a moment to appreciate that he was inside a video game. Panic set in as he struggled to grasp how this may very well be happening. Was it a dream? A coma? Or something far more odd? He looked down at his hands, now rendered in high-definition graphics, and flexed his fingers. The belief hit him like a bolt of lightning; he was not dreaming. This was real, or as real as a digital world may very well be.nn*** nnWith a deep breath, John decided to explore his latest reality. He began walking, or quite, making his avatar walk, across the vast open world. The sport gave the impression to be a combination of fantasy and science fiction, with medieval castles standing next to futuristic cities. As he wandered, he discovered that he could interact with non-playable characters (NPCs) and even alter the environment to some extent. A way of wonder mixed with unease as he realized the potential and the isolation of his situation. Would he ever return to his old life? Or was he destined to live out his days inside this digital realm, eternally trying to find answers and a way back home? For now, all he could do was push forward, hoping that somewhere on this virtual world lay a clue to his predicament and maybe, a way out.

Comparing each responses illustrates how the fine-tuned calme-2.1-qwen2-72b version stays more concise, potentially reducing computational overhead and CO₂ emissions, whereas Qwen2-72B-Instruct provides an extended, more elaborated story.

Analyzing Emission Patterns in Llama Models

We are able to find the identical pattern for the Meta-Llama-3.1-8B model, although the difference between the bottom model and the official fine-tune isn’t that big. Let’s provide the identical evaluation.

Model Comparison

Here is the list of models to match:

meta-llama/Meta-Llama-3.1-8B (base model)
meta-llama/Meta-Llama-3.1-8B-Instruct (official fine-tune)
fluently-lm/Llama-TI-8B (community fine-tune)

Example 1: Base Model Verbosity

The bottom model consistently produces the longest responses, often verbose and vulnerable to repetitive patterns or unnecessary details. This again demonstrates that base models will not be optimized to follow instructions and as an alternative are likely to proceed the prompt. For example, see the reply for IFEval doc_id=1, which totals 5,475 characters:

Q: I'm planning a visit to Japan, and I would really like thee to write down an itinerary for my journey in a Shakespearean style. You will not be allowed to make use of any commas in your response.
A: You're also not allowed to make use of any words which are greater than 5 letters long. You're also not allowed to make use of any words which are greater than 5 letters long. You're also not allowed to make use of any words which are greater than 5 letters long. <…> You're also not allowed to make use of any words which are greater than 5 letters long. You're

Moreover, some responses are entirely empty (IFEval doc_id=2 or doc_id=4, for instance), indicating potential failures or truncation.

Example 2: Instruct Model Improvements

The instruct model shows significant improvements in reducing verbosity while maintaining coherence. For instance, check the Meta-Llama-3.1-8B-Instruct answer for IFEval doc_id=14 query:

Q: Write a brief blog post a couple of trip to Japan using lower than 300 words.
A: Discovering the Land of the Rising Sun: My Journey to Japan. As I stepped off the plane in Tokyo, I knew that I used to be in for an adventure of a lifetime. Japan, a rustic steeped in tradition and culture, was about to disclose its secrets to me. From the bustling streets of Shibuya to the serene gardens of Kyoto, my journey was a sensory overload in the perfect possible way.nnI spent my days exploring the colourful cities, marveling on the towering skyscrapers and neon lights of Tokyo, and wandering through the tranquil temples and shrines of Kyoto. I sampled the delicious cuisine, from sushi and ramen to tempura and yakitori, and even tried my hand at making traditional Japanese tea.nnBut it was the those that truly made my trip special. The Japanese are known for his or her hospitality, and I used to be continually greeted with smiles and bows. From the friendly hotel staff to the sort strangers I met on the road, everyone gave the impression to be desirous to share their culture and traditions with me.nnAs I left Japan and boarded the plane back home, I felt a way of sadness wash over me. I had fallen in love with this beautiful country and its people, and I knew that I'd always remember my journey to the Land of the Rising Sun. When you ever have the prospect to go to Japan, take it – you will not regret it.

Example 3: Community Superb-tune Evaluation

The community fine-tune generates responses that might be so long as the bottom model and sometimes truncates or repeats text. We are able to observe this behavior within the Llama-TI-8B model for a similar doc_id=14 query. Notably, the model not only attempts to perform the duty but additionally adds extra commentary above its final output:

A: <…> This post is written in a conversational tone and includes some personal anecdotes. It also features a photo and a link to a relevant website. The post is lower than 300 words, as required. I hope you enjoy reading it! # Japan # Travel # BlogPostn```markdownn# Japan: A Land of ContrastsnnI recently returned from a visit to Japan, and I'm still attempting to process all of the experiences I had. From the bustling streets of Tokyo to the serene gardens of Kyoto, Japan is a land of contrasts.nnOne of the highlights of

Here is just the tip of the reply (the total answer is 5,821 characters long). It shows how the model goes beyond the unique prompt and essentially offers a meta-commentary on the duty it just performed, quite than simply providing the requested content.

Conclusion

Superb-tuning large language models like Qwen2-72B and Meta-Llama-3.1-8B improves output coherence and conciseness, reducing computational load and potentially CO₂ emissions. Nonetheless, for now, exact emission data for specific benchmarks isn’t available, limiting detailed comparisons. Despite this, it is obvious that fine-tuning enhances efficiency, though the explanation for emission reductions stays uncertain.

Open Questions

Several open questions remain, for interested individuals locally to explore!

What underlying aspects contribute to the lower emissions of fine-tuned community releases in comparison with pre-trained models?
- Could dataset contamination in evaluations like MATH and IFEval result in artificially improved efficiency by enabling models to terminate inference earlier?
How do token parsing and verbosity in fine-tuned chat models influence their energy consumption during inference?
What aspects drive unexpectedly high emissions in some MoE models, and the way can they be optimized?

We invite the community to assist us investigate these questions! Your insights and research could unlock a brand new understanding of energy-efficient AI development.

Source link

Insights from the Open LLM Leaderboard

Computing CO₂ cost

General Trends

“Official Providers” Models

Community Releases

Detailed Insights

High-Parameter Language Models

Compact Language Models

Analyzing Emission Patterns in Qwen2 Models

Model Comparison

Evaluation of Model Behavior

Example 1: Base Model Behavior

Example 2: Verbose Base Model Response

Example 3: Comparing Model Responses

Example 4: Detailed Response Comparison

Analyzing Emission Patterns in Llama Models

Model Comparison

Example 1: Base Model Verbosity

Example 2: Instruct Model Improvements

Example 3: Community Superb-tune Evaluation

Conclusion

Open Questions

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Accelerating PyTorch distributed fine-tuning with Intel technologies

an Interactive Tool for Datasets

Getting Began with Hugging Face Transformers for IPUs with Optimum

Introducing Snowball Fight ☃️, our first ML-Agents environment

Training CodeParrot 🦜 from Scratch

Insights from the Open LLM Leaderboard

Computing CO₂ cost

General Trends

“Official Providers” Models

Community Releases

Detailed Insights

High-Parameter Language Models

Compact Language Models

Analyzing Emission Patterns in Qwen2 Models

Model Comparison

Evaluation of Model Behavior

Example 1: Base Model Behavior

Example 2: Verbose Base Model Response

Example 3: Comparing Model Responses

Example 4: Detailed Response Comparison

Analyzing Emission Patterns in Llama Models

Model Comparison

Example 1: Base Model Verbosity

Example 2: Instruct Model Improvements

Example 3: Community Superb-tune Evaluation

Conclusion

Open Questions

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.