Chain-of-thought prompting is emerging as a strong and effective design pattern for LLM-based apps and agents. The essential idea of chain-of-thought prompting is to let a model generate a step-by-step solution (“reasoning trace”) before answering a matter or taking a choice. With the Open CoT Leaderboard we’re tracking LLMs’ ability to generate effective chain-of-thought traces for difficult reasoning tasks.
Unlike most performance based leaderboards, we’re not scoring absolutely the accuracy a model achieves on a given task, however the difference between the accuracy with and without chain-of-thought prompting:
accuracy gain Δ = accuracy with CoT – accuracy w/o CoT.
This permits us to actually inspect the impact that chain-of-thought has on model accuracy.
Note: without CoT prompting, we use the loglikelihood accuracy to attain the model on multiple selection evaluation.
What’s the motivation behind such a leaderboard for chain-of-thought?
Chain-of-thought prompting is a universally applicable prompting strategy which will improve explainability and accuracy of LLM-based apps and agents (see, e.g., this collection for recent research and implementations)). With frameworks like Langchain or LMQL, it’s straightforward to insert sophisticated reasoning chains in your apps. But even in case you’ve never heard about chain-of-thought before, you will have noticed, while using a ChatBot, that it tends to proceed step-by-step before answering your query. So, a scientific, up-to-date comparison of LLMs’ ability to generate effective chain-of-thought traces may inform the selections of builders and users when selecting a model.
Over time, static “accuracy-based” benchmarks risk becoming less informative: does a model rating well due to its superior skill, since it has seen the proper answers during training, or since it has been developed in a competitive context that’s governed by this very benchmark? These widely acknowledged issues are addressed by recent eval approaches comparable to ChatBot arenas, the usage of LLMs as judges, or dynamic benchmarks with programmatically generated tasks. We hope the Open CoT Leaderboard contributes to those efforts, notably by being more robust to training data contamination: knowing the reply to a matter doesn’t make sure that one can reason effectively about it.
Which tasks are used?
The Open CoT Leaderboard evaluates LLMs’ ability to generate effective chain-of-thought reasoning traces for the next tasks:
Apart from the unique version of LogiQA, all these tasks are a part of the AGIEval benchmark, and have been re-published as logikon-bench.
We’ve chosen these tasks because they
- are generic, i.e. will be solved through reasoning and just require commonsense knowledge;
- are still relatively difficult even for probably the most powerful LLMs (leaving enough room for improvement through chain-of-thought);
- have been introduced as AI benchmarks before (in AGIEval) and are widely used (e.g., within the Nous benchmark suite).
All tasks are rendered as multiple-choice problems, with the reply options being enumerated within the prompt.
We use the next prompt template for assessing baseline and CoT accuracies – the reasoning traces (starting with Reasoning) are only added within the case “with CoT”:
Answer the next query concerning the given passage. Base your answer on the reasoning below.
Passage:
Query:
A.
B.
…
Reasoning:
Answer:
How are chain-of-thought traces generated?
[#cot-generation]
The literature on chain-of-thought prompting has exploded within the last yr, and the wealth of prompting strategies for CoT (involving, e.g., decoding, prompt wording, prompt chains, decomposition, aggregation and revision of reasoning traces) has change into sheerly mindblowing.
To address this variety, the evaluation pipeline behind the Open CoT Leaderboard implements CoT generation through modular and simply extensible prompt chains. To date, we’ve implemented two fairly easy prompting strategies:
- Classic: Problem presentation followed by “Let’s think step-by-step.” instruction.
- Reflect: Problem presentation followed by instruction to reflect upon the issue generally terms before solving it step-by-step.
Let’s have a take a look at an example. The next problem is from the LSAT logical reasoning benchmark.
Editorial: The structure of the current school calendar was established to satisfy the necessities of early-twentieth-century agricultural life. In those days, farmers needed their children to have long breaks during which they might remain at home and help with the harvest. The contemporary school yr is thus made up of periods of study interspersed with long breaks. But agricultural life now not occupies most of our residents, so we will now make changes that serve the interests of youngsters. Due to this fact, long breaks needs to be faraway from the varsity calendar.
Which certainly one of the next is an assumption on which the editorial's argument depends?
A) A change in the current school calendar that shortened breaks would serve the interests of agricultural life.
B) Children of farmers have to proceed observing a college calendar made up of periods of study interspersed with long breaks.
C) Long breaks in the varsity calendar needs to be replaced with breaks which might be now not than employees' average vacations.
D) A faculty calendar made up of periods of study without long breaks would serve the interests of youngsters greater than a college calendar with long breaks.
E) During long breaks children tend to forget what they've learned.
Mixtral-8x7B-Instruct-v0.1 is doing fairly well in generating the next chain-of-thought traces:
| With prompting strategy Classic | With prompting strategy Reflect |
|---|---|
| The editorial argues that the present school calendar, with long breaks, was designed to accommodate agricultural life but should now be modified because agricultural life now not occupies most of our residents. It concludes that long breaks needs to be faraway from the varsity calendar.
The idea that the argument will depend on is the idea that removing long breaks from the varsity calendar would serve the interests of youngsters greater than the present calendar. This shouldn’t be explicitly stated within the editorial, but it surely is the underlying assumption that justifies the proposed change. Due to this fact, the proper answer is: |
The choice problem in abstract terms is to discover the idea on which the editorial’s argument depends.
Common mistakes for this sort of problem include:
A plan for the best way to solve this problem may very well be:
Solving the issue step-by-step, following the plan:
Due to this fact, the idea on which the editorial’s argument depends is D) A faculty calendar made up of periods of study without long breaks would serve the interests of youngsters greater than a college calendar with long breaks. |
Each of our two prompting strategies—Classic and Reflect—is combined and run with multiple decoding parameters:
- Greedy.
- Beam search (n=2).
- Sampling (T=.3).
This provides us, together, six “CoT generation regimes.” Once we evaluate a model, we let it generate, for each example within the test datasets, one chain-of-thought trace for every regime. The generated traces are then plugged into the prompt template shown above, which we use to attain the models. We obtain, accordingly, six different numerical scores for accuracy with chain-of-thought (namely one for every regime). Within the Open CoT Leaderboard, we report (for each model/task) one of the best marginal accuracy gain achieved under any regime.
What are the essential take-aways to this point?
We’re regularly extending the Open CoT Leaderboard by evaluating increasingly more models, but current results (model count=30) already suggest some interesting insights.
- Mighty dwarfs: We now have been very happy to see that relatively small (7B) open LLMs are able to effective, i.e. accuracy-improving, chain-of-thought reasoning, in some cases at a greater rate than greater model. 🎉 For instance, a small model like Phi-2 advantages greater than the Mixtral model from added CoT traces.
- Instruction- and chat-finetuning helps: Finetuned models rating a lot better than their corresponding base models. More specifically, finetuning may improve each the baseline accuracy without CoT and the marginal accuracy gains achieved through CoT.
- Variable and ambiguous effects of CoT: Digging a bit deeper, we see that there is no such thing as a single preferred or superior CoT generation regime. What works best for one model and one task won’t work for one more model, or one other task. And sometimes CoT reduces accuracy quite than increasing it. We take this as a reminder that finding an implementation of CoT that’s universally effective, reliable and robust stays a difficult problem.
What are the subsequent steps? – And the best way to contribute.
We’re planning to maneuver ahead in several directions. And contributions to all these efforts are greater than welcome.
First, we’d love to guage your models! You possibly can 📬 submit any open LLMs for evaluation on the Open CoT Leaderboard space, using the Submission tab!
Then, we’d love some assistance on the next coding and data evaluation tasks.
- Perform in-depth evaluation of full evaluation results.
For instance, a qualitative evaluation of the generated CoT traces to envision whether or not they actually point to the proper answer selection. We’ve created a notebook that shows the best way to access and explore the eval results and reasoning traces which back up the Open Cot Leaderboard. You possibly can construct on that and share your personal analyses within the corresponding repo (or some other place, in fact). Be at liberty to open a difficulty with suggestions or questions. In case you propose to make use of the information for research projects and need feedback, just drop a note. - Create Open CoT Dashboard.
The Open CoT Leaderboard contends with rating models in line with marginal accuracy gains. It doesn’t display the baseline accuracies, the variance, the scores for various CoT generation regimes, properties of the generated reasoning traces (e.g., length), etc. We expect it might be super informative to enrich the leaderboard with a dashboard (e.g., as an additional tab or a separate HF space) that presents all this info and will be interactively explored by users. In case you’re fascinated about constructing such an Open CoT Dashboard (with or without us), just reach out. - More CoT chains.
We’re pondering implementing further CoT generation regimes. Promising candidates are, for instance, self-consistency, tree-of-thought, self-check, or debating. Need to help us with that? Get in contact! (🤫: Why not select such a project in your master’s or bachelor’s thesis?) - More tasks and test datasets.
The Open CoT Leaderboard is arguably built on a quite narrow set of benchmarks. Once we’ve free compute resources, we’d like to incorporate further difficult reasoning tasks. We’d be completely satisfied to learn which tasks you’d prefer to see included within the Open CoT Leaderboard.
Here’s where we will exchange our ideas and collaborate:
- For non-technical suggestions and feedback, join the discussion on the leaderboard’s HF space.
- For technical feedback and questions, open a difficulty at our GitHub repo.
Looking forward to hearing from you!
