Large language models (LLMs) are increasingly becoming a primary source for information delivery across diverse use cases, so it’s essential that their responses are factually accurate.
To be able to proceed improving their performance on this industry-wide challenge, we have now to raised understand the forms of use cases where models struggle to supply an accurate response and higher measure factuality performance in those areas.
The FACTS Benchmark Suite
Today, we’re teaming up with Kaggle to introduce the FACTS Benchmark Suite. It extends our previous work developing the FACTS Grounding Benchmark, with three additional factuality benchmarks, including:
- A Parametric Benchmark that measures the model’s ability to access its internal knowledge accurately in factoid query use-cases.
- A Search Benchmark that tests a model’s ability to make use of Search as a tool to retrieve information and synthesize it appropriately.
- A Multimodal Benchmark that tests a model’s ability to reply prompts related to input images in a factually correct manner.
We’re also updating the unique FACTS grounding benchmark with Grounding Benchmark – v2, an prolonged benchmark to check a model’s ability to supply answers grounded within the context of a given prompt.
Each benchmark was fastidiously curated to provide a complete of three,513 examples, which we’re making publicly available today. Much like our previous release, we’re following standard industry practice and keeping an evaluation set held-out as a personal set. The FACTS Benchmark Suite Rating (or FACTS Rating) is calculated as the typical accuracy of each private and non-private sets across the 4 benchmarks. Kaggle will oversee the management of the FACTS Benchmark Suite. This includes owning the private held-out sets, testing the leading LLMs on the benchmarks, and hosting the outcomes on a public leaderboard. More details in regards to the FACTS evaluation methodology could be present in our tech report.
Benchmark overview
Parametric Benchmark
The FACTS Parametric benchmark assesses the power of models to accurately answer factual questions, without assistance from external tools like web search. All of the questions within the benchmark are “trivia style” questions driven by user interest that could be answered via Wikipedia (a regular source for LLM pretraining). The resulting benchmark consists of a 1052-item public set and a 1052-item private set.
