Across the organizations where this approach has emerged and commenced to be applied, step one is shifting the unit of research.
For instance, in a single UK hospital system within the period 2021–2024, the query expanded from whether a medical AI application improves diagnostic accuracy to how the presence of AI throughout the hospital’s multidisciplinary teams affects not only accuracy but in addition coordination and deliberation. The hospital specifically assessed coordination and deliberation in human teams using and never using AI. Multiple stakeholders (inside and out of doors the hospital) selected metrics like how AI influences collective reasoning, whether it surfaces neglected considerations, whether it strengthens or weakens coordination, and whether it changes established risk and compliance practices.
This shift is prime. It matters lots in high-stakes contexts where system-level effects matter greater than task-level accuracy. It also matters for the economy. It might help recalibrate inflated expectations of sweeping productivity gains which can be to this point predicated largely on the promise of improving individual task performance.
Once that foundation is ready, HAIC benchmarking can begin to tackle the element of time.
Today’s benchmarks resemble school exams—one-off, standardized tests of accuracy. But real skilled competence is assessed otherwise. Junior doctors and lawyers are evaluated constantly inside real workflows, under supervision, with feedback loops and accountability structures. Performance is judged over time and in a particular context, because competence is relational. If AI systems are supposed to operate alongside professionals, their impact needs to be judged longitudinally, reflecting how performance unfolds over repeated interactions.
I saw this aspect of HAIC applied in considered one of my humanitarian-sector case studies. Over 18 months, an AI system was evaluated inside real workflows, with particular attention to how detectable its errors were—that’s, how easily human teams could discover and proper them. This long-term “record of error detectability” meant the organizations involved could design and test context-specific guardrails to advertise trust within the system, despite the inevitability of occasional AI mistakes.
An extended time horizon also makes visible the system-level consequences that short-term benchmarks miss. An AI application may outperform a single doctor on a narrow diagnostic task yet fail to enhance multidisciplinary decision-making. Worse, it might introduce systemic distortions: anchoring teams too early in plausible but incomplete answers, adding to people’s cognitive workloads, or generating downstream inefficiencies that offset any speed or efficiency gains at the purpose of the AI’s use. These knock-on effects—often invisible to current benchmarks—are central to understanding real impact.
The HAIC approach, admittedly guarantees to make benchmarking more complex, resource-intensive, and harder to standardize. But continuing to judge AI in sanitized conditions detached from the world of labor will leave us misunderstanding what it truly can and can’t do for us. To deploy AI responsibly in real-world settings, we must measure what actually matters: not only what a model can do alone, but what it enables—or undermines—when humans and teams in the true world work with it.
