Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are price investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to assist refine it. Because the project grew more sophisticated, they decided to release it to the general public.
Xbench approached the issue with two different systems. One is analogous to traditional benchmarking: an educational test that gauges a model’s aptitude on various subjects. The opposite is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.
Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the suitable answer but in addition the reasoning chain that results in it.
DeepResearch, against this, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that may’t just be googled but require significant research to reply. Scoring favors breadth of sources, factual consistency, and a model’s willingness to confess when there isn’t enough data. A matter within the publicized collection is “What number of Chinese cities within the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, when you are wondering.)
On the corporate’s website, the researchers said they wish to add more dimensions to the test—for instance, features like how creative a model is in its problem solving, how collaborative it’s when working with other models, and the way reliable it’s.
The team has committed to updating the test questions once 1 / 4 and to take care of a half-public, half-private data set.
To evaluate models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For instance, one task asks a model to source five qualified battery engineer candidates and justify each pick. One other asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.
The web site also teases upcoming categories, including finance, legal, accounting, and design. The query sets for these categories haven’t yet been open-sourced.
ChatGPT-o3 again ranks first in each of the present skilled categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.
“It is absolutely difficult for benchmarks to incorporate things which can be so hard to quantify,” says Zihan Zheng, the lead researcher on a brand new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”