Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

We’re glad to introduce the NPHardEval leaderboard, using NPHardEval, a cutting-edge benchmark developed by researchers from the University of Michigan and Rutgers University.

NPHardEval introduces a dynamic, complexity-based framework for assessing Large Language Models’ (LLMs) reasoning abilities. It poses 900 algorithmic questions spanning the NP-Hard complexity class and lower, designed to scrupulously test LLMs, and is updated on a monthly basis to forestall overfitting!

A Unique Approach to LLM Evaluation

NPHardEval stands apart by employing computational complexity classes, offering a quantifiable and robust measure of LLM reasoning skills. The benchmark’s tasks mirror real-world decision-making challenges, enhancing its relevance and applicability. Regular monthly updates of the benchmark data points mitigate the danger of model overfitting, ensuring a reliable evaluation.

The key contributions of NPHardEval are latest using latest benchmarking strategies (proposing an automatic and dynamic benchmark), and introducing a brand new approach to evaluate LLM reasoning.

Regarding benchmarking strategies, NPHardEval uses an automated mechanism, each to generate and check questions within the benchmark. Since they’re based on algorithmically computable problems, human intervention isn’t required to find out the correctness of the responses from LLMs. This also allows NPHardEval to be a dynamic benchmark: since questions may be mechanically generated, the benchmark may be updated on a monthly basis. This monthly-refreshed benchmark helps prevent model overfitting as we will all the time generate novel questions with various difficulty levels for evaluation.

The questions themselves use a brand new system to judge LLM Reasoning. The questions within the benchmark are grounded within the computational complexity hierarchy, a well-established concept extensively studied in theoretical computer science. This foundation enables us to leverage existing research to scrupulously and quantitatively measure an LLM’s logical reasoning extent, by defining reasoning via complexity classes. The benchmark also deliberatley excludes numerical computation from the questions, because it is a notoriously difficult task for LLMs. Specializing in logical questions allows for a more accurate evaluation of an LLM’s pure logical reasoning ability, as numerical questions can obscure this assessment.

Data Synthesis

NPHardEval uses 100 questions for every of 9 different algorithms, with 10 difficulty levels, leading to 900 questions across complexity and difficulty. The 9 algorithms, including 3 P, 3 NP-complete, and three NP-hard questions, are characterised in accordance with the computing theory. The 900 questions are all synthesized and updated monthly.

More background and insights can be found in these slides.

Evaluation Metrics

We use two metrics to judge the reasoning ability of LLMs: Weighted Accuracy and Failure Rate.

Weighted Accuracy (WA)

Weighted Accuracy (WA) is used to judge problem-solving accuracy. This method is applied to every problem, either through comparison with an accurate answer or via step-by-step result checking for problems with out a singular answer. To reflect comparative accuracy more effectively, we assign weights to different difficulty levels. Each level’s weight corresponds to its relative importance or challenge, with higher difficulty levels receiving more weight in a linear progression (as an illustration, level 1 has weight 1, level 2 has weight 2, and so forth).

The formula for Weighted Accuracy is as follows:

$WA = frac{sumlimits_{i=1}^{10} (w_i times A_i)}{sumlimits_{i=1}^{10} w_i}$

On this equation, $w_{i}$

Failure Rate (FR)

One other critical metric we consider is the Failure Rate (FR). This measure helps assess the frequency of unsuccessful outcomes across different problems and difficulty levels. It’s particularly useful for identifying instances where an LLM’s result doesn’t match the expected output format.

The Failure Rate is calculated by considering the proportion of failed attempts relative to the full variety of attempts for every difficulty level. An attempt is counted as failed if the model generates results that can not be successfully parsed in all endpoint calls. We set the utmost variety of tries as 10. For every problem, the Failure Rate is then aggregated across all difficulty levels, considering the full of 10 attempts at each level.

The formal definition of Failure Rate is:

$FR = frac{sumlimits_{i=1}^{10} F_i}{100}$

Here, $F_{i}$

Experimentation and Insights

The benchmark includes comprehensive experiments to investigate LLMs across various complexity classes and difficulty levels. It delves into the nuances of LLM performance, providing precious insights into their reasoning strengths and limitations. On the whole:

Closed-source models generally perform higher than open-source models, with GPT 4 Turbo performing one of the best overall.
Models generally perform higher on less-complex questions, i.e. easier complexity classes, while not all the time linearly decrease on complexity levels. Models corresponding to Claude 2 perform one of the best on NP-complete (middle-complexity) questions.
Some open-source models can outperform close-source models on specific questions. Leading open-source models include Yi-34b, Qwen-14b, Phi-2, and Mistral-7b.

Reproducing NPHardEval Benchmark results in your machine

To establish the NPHardEval Benchmark, you must follow a number of steps:

Environment setup: after cloning the repository to your local machine, install the required python library with conda.

conda create --name llm_reason python==3.10
conda activate llm_reason
git clone https://github.com/casmlab/NPHardEval.git
pip install -r requirements.txt

Set-up API keys: fetch API keys and alter the corresponding entries in secrets.txt.
Example Commands: evaluate your model with the NPHardEval benchmark!

For instance, to make use of the GPT 4 Turbo model (GPT-4-1106-preview) and the edit distance problem (EDP) for evaluation:

For its zeroshot experiment, we will use:

  cd Close/run
  python run_p_EDP.py gpt-4-1106-preview

For its fewshot experiment,

  cd Close/run
  python run_p_EDP_few.py gpt-4-1106-preview self

We currently support fewshot examples from the identical query (self), and should support examples from other questions (other) in the long run.

Join the Conversation

The NPHardEval leaderboard, dataset and code can be found on Github and Hugging Face for community access and contributions.

We’ll like to see community contributions and interest on the NPHardEval GitHub Repository and Hugging Face Leaderboard.

Source link

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

A Unique Approach to LLM Evaluation

Data Synthesis

Evaluation Metrics

Weighted Accuracy (WA)

Failure Rate (FR)

Experimentation and Insights

Reproducing NPHardEval Benchmark results in your machine

Join the Conversation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Automatic Prompt Optimization for Multimodal Vision Agents: A Self-Driving Automobile Example

Segmind Mixture of Diffusion Experts

Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

A Unique Approach to LLM Evaluation

Data Synthesis

Evaluation Metrics

Weighted Accuracy (WA)

Failure Rate (FR)

Experimentation and Insights

Reproducing NPHardEval Benchmark results in your machine

Join the Conversation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.