Hugging Face’s Open LLM Leaderboard (originally created by Ed Beeching and Lewis Tunstall, and maintained by Nathan Habib and Clémentine Fourrier) is well-known for tracking the performance of open source LLMs, comparing their performance in a wide range of tasks, similar to TruthfulQA or HellaSwag.
This has been of tremendous value to the open-source community, because it provides a way for practitioners to maintain track of one of the best open-source models.
In late 2023, at Vectara we introduced the Hughes Hallucination Evaluation Model (HHEM), an open-source model for measuring the extent to which an LLM hallucinates (generates text that’s nonsensical or unfaithful to the provided source content). Covering each open source models like Llama 2 or Mistral 7B, in addition to business models like OpenAI’s GPT-4, Anthropic Claude, or Google’s Gemini, this model highlighted the stark differences that currently exist between models when it comes to their likelihood to hallucinate.
As we proceed so as to add recent models to HHEM, we were in search of an open-source solution to administer and update the HHEM leaderboard.
Quite recently, the Hugging Face leaderboard team released leaderboard templates (here and here). These are lightweight versions of the Open LLM Leaderboard itself, that are each open-source and simpler to make use of than the unique code.
Today we’re completely happy to announce the discharge of the recent HHEM leaderboard, powered by the HF leaderboard template.
Vectara’s Hughes Hallucination Evaluation Model (HHEM)
The Hughes Hallucination Evaluation Model (HHEM) Leaderboard is devoted to assessing the frequency of hallucinations in document summaries generated by Large Language Models (LLMs) similar to GPT-4, Google Gemini or Meta’s Llama 2. To make use of it you’ll be able to follow the instructions here.
By doing an open-source release of this model, we at Vectara aim to democratize the evaluation of LLM hallucinations, driving awareness to the differences that exist in LLM performance when it comes to propensity to hallucinate.
Our initial release of HHEM was a Huggingface model alongside a Github repository, but we quickly realized that we would have liked a mechanism to permit recent sorts of models to be evaluated. Using the HF leaderboard code template, we were capable of quickly put together a brand new leaderboard that enables for dynamic updates, and we encourage the LLM community to submit recent relevant models for HHEM evaluation.
Organising HHEM with the LLM leaderboard template
To establish the Vectara HHEM leaderboard, we needed to follow a couple of steps, adjusting the HF leaderboard template code to our needs:
- After cloning the space repository to our own organization, we created two associated datasets: “requests” and “results”; these datasets maintain the requests submitted by users for brand spanking new LLMs to guage, and the outcomes of such evaluations, respectively.
- We populated the outcomes dataset with existing results from the initial launch, and updated the “About” and “Citations” sections.
For a straightforward leaderboard, where evaluations results are pushed by your backend to the outcomes dataset, that’s all you would like!
As our evaluation is more complex, we then customized the source code to suit the needs of the HHEM leaderboard – listed below are the small print:
leaderboard/src/backend/model_operations.py: This file incorporates two primary classes –SummaryGeneratorandEvaluationModel.
a. TheSummaryGeneratorgenerates summaries based on the HHEM private evaluation dataset and calculates metrics like Answer Rate and Average Summary Length.
b. TheEvaluationModelloads our proprietary Hughes Hallucination Evaluation Model (HHEM) to evaluate these summaries, yielding metrics similar to Factual Consistency Rate and Hallucination Rate.leaderboard/src/backend/evaluate_model.py: defines theEvaluatorclass which utilizes eachSummaryGeneratorandEvaluationModelto compute and return ends in JSON format.leaderboard/src/backend/run_eval_suite.py: incorporates a functionrun_evaluationthat leveragesEvaluatorto acquire and upload evaluation results to theresultsdataset mentioned above, causing them to seem within the leaderboard.leaderboard/main_backend.py: Manages pending evaluation requests and executes auto evaluations using aforementioned classes and functions. It also includes an option for users to copy our evaluation results.
The ultimate source code is on the market in the Files tab of our HHEM leaderboard repository.
With all these changes, we now have the evaluation pipeline able to go, and simply deployable as a Huggingface Space.
Summary
The HHEM is a novel classification model that could be used to guage the extent to which LLMs hallucinate. Our use of the Hugging Face leaderboard template provided much needed support for a typical need for any leaderboard: the power to administer the submission of recent model evaluation requests, and the update of the leaderboard as recent results emerge.
Big kudos to the Hugging Face team for making this beneficial framework open-source, and supporting the Vectara team within the implementation. We expect this code to be reused by other community members who aim to publish other sorts of LLM leaderboards.
If you would like to contribute to the HHEM with recent models, please submit it on the leaderboard – we very much appreciate any suggestions for brand spanking new models to guage.
And if you may have any questions on the Hugging Face LLM front-end or Vectara, please be happy to achieve out within the Vectara or Huggingface forums.
