Today’s large language models (LLMs) are incredibly “book smart.” They’ll write beautiful essays, answer trivia questions, and even pass bar exams—all throughout the boundaries of curated benchmarks. But do these benchmark results truly reflect strong performance on real-world tasks handled at the extent of PhD or MBA professionals?
The SWE-Bench dataset moves evaluation closer to real-world workflows by testing how well LLMs fix software bugs and construct latest features. Nevertheless, a major gap still stays: the dearth of high-quality, text-only datasets that mirror the complex reasoning tasks faced by professionals in fields like finance and materials science. We’re not talking about easy Q&A or retrieval-based tasks. We’re talking about multi-page assignments that require deep domain knowledge and reasoning. Can AI generate comprehensive reports by applying the nuanced reasoning that a PhD-level physicist/chemist or an MBA-level consultant/financier would have? To accurately measure these advanced capabilities, we want a brand new benchmark: ProfBench — now supported directly throughout the NVIDIA NeMo Evaluator SDK.
The NeMo Evaluator SDK provides a scalable, reproducible option to run lots of of benchmarks built on top of popular evaluation repos including LM-eval-harness, simple-evals, and BigCode and compare model performance.
What’s ProfBench?
ProfBench is a brand new benchmark designed to judge LLMs on complex, open-ended tasks that require professional-grade knowledge. The dataset comprises over 7,000 response-criterion pairs across 4 deep-expertise domains:
- Finance MBA
- Consulting MBA
- Chemistry PhD
- Physics PhD
To know the complexity, let us take a look at a Finance MBA example.
A user, acting as a senior partner at an investment bank, asks the AI to evaluate a possible latest business unit focused on global health empowerment. The prompt is not a single query; it is a multi-step task that features :
1. Analyzing the history of the International Finance Facility for Immunization (IFFIm) and the way it used securitization to lift money for the GAVI vaccine alliance
2. Detailing the technical points, aspects for fulfillment, and risks involved
3. Assessing if IFFIm can function a "blueprint" for other global health initiatives
4. Identifying 3-5 other organizations that might use an identical model
5. Delivering all the evaluation within the variety of an in depth investment memo, not only a listing of answers
That is the type of task that demands evaluation, synthesis, and domain-specific knowledge far beyond easy fact retrieval.
Similarly, in a Chemistry PhD example,
A user in a research lab asks AI to perform calculations required in a posh titration experiment involving two acids - 100 mL mixture of acetic acid (0.5 M) and formic acid (0.1 M) with 0.5 M NaOH, that features :
1. Calculating the amount of NaOH titrant required to succeed in the purpose where the 2 conjugate bases have equal concentrations.
2. Calculating the concentrations of the acids and their conjugate bases at the purpose referenced partially 1.
3. Calculating the concentration of hydronium ions and the pH of the analyte at the purpose referenced partially 1.
4. Calculating the amount of NaOH titrant required to succeed in the purpose where the pH of the analyte is 7.0.
5. Calculating the concentrations of the acids and their conjugate bases at the purpose referenced partially 4
What Makes ProfBench Special?
Fig 1. Distribution of Rubrics across categories and sub-categories.
What makes the benchmark special? The grading system isn’t nearly getting a multiple-choice or short-answer right. As an alternative, now we have human experts write rubrics that evaluate the AI’s work on three dimensions using a various set of criteria:
- Extraction: Did it get the best data and details?
- Reasoning: Is the logic sound? Is the mathematics correct? Are the conclusions justified?
- Style: Is the reply presented clearly, and within the requested format?
The LLM response is then graded by way of whether it fulfils various rubric criteria resembling:
For the Finance MBA:
Extraction: States that a breach of IFFIm’s liquidity policy could negatively impact IFFIm’s rating profile.
Reasoning: States that vaccines are probably the most successful and cost-effective health investments on this planet
Style: Present findings clearly to permit for effective use.
For the Chemistry PhD:
Extraction: Determines the amount of NaOH titrant required to succeed in the purpose where the pH of the analyte is 7.0 as 0.11938 +/- 0.001 L
Reasoning: Determines the pH of the analyte at the purpose at which each acids are neutralized as 9.05 +/- 0.05.
Style: The molecular weight is rounded to 1 decimal place.
How Was the Benchmark Created?
ProfBench was built by the very experts it’s designed to check. We recruited 38 professionals from 8 countries, all holding PhDs, MBAs, or equivalent work experience of their respective fields. Together, these experts contributed over 7000 rubrics across 80 tasks.
These experts curated the prompts themselves, basing them on tasks they may assign to a junior colleague. Most significantly, in addition they wrote the detailed, multi-point grading rubrics from scratch.
To make sure true human-level authenticity and stop model bias, we disallowed the usage of LLMs at any stage of the annotation process. It is a benchmark built by human professionals for evaluating professional-grade AI.
Why Release ProfBench — and Why Now?
A strong evaluation dataset is one in all the most important bottlenecks to advancing open-source models from performing complex, skilled tasks. To make these evaluations seamless and reproducible, ProfBench is now fully supported through the NeMo Evaluator SDK—enabling automated, rubric-based scoring and side-by-side model comparisons out of the box.
Our primary goal is to spur progress across the open-source community by providing a transparent, public benchmark—a real north for developing models and agentic systems that may tackle real-world business and science research challenges. Just as datasets like SWE-Bench have pushed the sphere forward, we see ProfBench as the subsequent step in our contribution to the ecosystem, constructing on our work with open-source NVIDIA Nemotron models and training data.
This work also has immediate advantages for enterprise users, helping businesses using AI to make use of rubric-based evaluations more effectively and providing confidence in workflows and tools like LLM-as-a-Judge. Long-term, this benchmark is foundational for constructing the subsequent generation of models—ones that may provide real-world value to human professionals. For AI to change into a real skilled partner, it must move beyond easy knowledge recall to master complex, real-world reasoning. ProfBench provides that critical roadmap, showing us where today’s AI stands and lighting the trail toward solving problems that, until now, only human experts could.
How Do Today’s Models Perform?
Fig 2. Cost of running full evaluation (16 samples per prompt) with human-identified reference documents against performance on ProfBench.
ProfBench poses a major challenge even for state-of-the-art models. The highest-performing model, GPT-5-High, scored just 65.9% overall when supplied with human-identified reference documents (i.e., easiest setting) and 49.4% with an LLM-only setup (i.e., hardest setting). This demonstrates an enormous gap that also exists between current AI models and expert-level skilled performance. Notably, the model struggled probably the most with the Physics domain (scoring only 49.3% when provided reference documents).
Find out how to Use This Dataset
We’re excited to see how the community uses ProfBench to check, fine-tune, and construct the subsequent generation of models and generative AI systems.
You should use the dataset out-of-the-box with the newly released NeMo Evaluator SDK. Note that since the reference documents should not included on this dataset, NeMo Evaluator only supports running these benchmarks with an LLM-only setup (hardest setting).
ProfBench is released under the NVIDIA Evaluation Dataset License. Learn more about it on the paper. We stay up for seeing what you construct with it.



