Why AI Alignment Starts With Higher Evaluation

at IBM TechXchange, I spent loads of time around teams who were already running LLM systems in production. One conversation that stayed with me got here from LangSmith, the parents who construct tooling for monitoring, debugging, and evaluating LLM workflows.

I originally assumed evaluation was mostly about benchmarks and accuracy numbers. They pushed back on that immediately. Their point was straightforward: a model that performs well in a notebook can still behave unpredictably in real usage. Should you aren’t evaluating against realistic scenarios, you aren’t aligning anything. You’re simply guessing.

Two weeks ago, at Cohere Labs Connect Conference 2025, the subject resurfaced again. This time the message got here with much more urgency. One in every of their leads identified that public metrics might be fragile, easy to game, and infrequently representative of production behavior. Evaluation, they said, stays one in every of the toughest and least-solved problems in the sphere.

Hearing the identical warning from two different places made something click for me. Most teams working with LLMs aren’t wrestling with philosophical questions on alignment. They’re coping with on a regular basis engineering challenges, reminiscent of:

Why does the model change behavior after a small prompt update?
Why do user queries trigger chaos even when tests look clean?
Why do models perform well on standardized benchmarks but poorly on internal tasks?
Why does a jailbreak succeed even when guardrails seem solid?

If any of this feels familiar, you might be in the identical position as everyone else who’s constructing with LLMs. That is where alignment begins to feel like an actual engineering discipline as an alternative of an abstract conversation.

This text looks at that turning point. It’s the moment you realize that demos, vibes, and single-number benchmarks don’t let you know much about whether your system will delay under real conditions. Alignment genuinely starts while you define what matters enough to measure, together with the methods you’ll use to measure it.

So let’s take a better take a look at why evaluation sits at the middle of reliable LLM development, and why it finally ends up being much harder, and rather more essential, than it first appears.

What “alignment” means in 2025
Capability ≠ alignment: what the previous couple of years actually taught us
How misalignment shows up now (not hypothetically)
Evaluation is the backbone of alignment (and it’s getting more complex)
Alignment is inherently multi-objective
When things go flawed, eval failures often come first
Where this series goes next
References

What “alignment” means in 2025

Should you ask ten people what “AI alignment” means, you’ll often get ten answers plus one existential crisis. Thankfully, recent surveys attempt to pin it down with something resembling consensus. A serious review — AI Alignment: A Comprehensive Survey (2025) — defines alignment as making AI systems behave in step with human intentions and values.

Not “make the AI clever,” not “give it perfect ethics,” not “turn it right into a digital Gandalf.”

Just: please do what we meant, not what we by chance typed.

Each surveys organize the sphere around 4 goals: Robustness, Interpretability, Controllability, and Ethicality — the RICE framework, which feels like a healthful meal but is definitely a taxonomy of the whole lot your model will do flawed if you happen to ignore it.

Meanwhile, industry definitions, including IBM’s 2024–2025 alignment explainer, describe the identical idea with more corporate calm: encode human goals and values so the model stays . Translation: avoid bias, avoid harm, and ideally avoid the model confidently hallucinating nonsense like a Victorian poet who never slept.

Across research and industry, alignment work is commonly split into two buckets:

Forward alignment: how we models (e.g., RLHF, Constitutional AI, data curation, safety finetuning).
Backward alignment: how we models after (and through) training.

Forward alignment gets all of the publicity.
Backward alignment gets all of the ulcers.

Figure: The Alignment Cycle, credit: AI Alignment: A Comprehensive Survey (Jiaming Ji et al.)

Should you’re an information scientist or engineer integrating LLMs, you always feel alignment as backward-facing questions:

Is that this latest model hallucinating less, or simply hallucinating ?
Does it stay protected when users send it prompts that appear to be riddles written by a caffeinated goblin?
Is it actually fair across the user groups we serve?

And unfortunately, you may’t answer those with parameter count or “it feels smarter.” You wish evaluation.

Capability ≠ alignment: what the previous couple of years actually taught us

One of the crucial essential ends in this space still comes from Ouyang et al.’s InstructGPT paper (2022). That study showed something unintuitive: a 1.3B parameter model with RLHF was often preferred over the unique 175B GPT-3, despite being about 100 times smaller. Why? Because humans said its responses were more helpful, more truthful, and fewer toxic. The massive model was more capable, however the small model was higher behaved.

This same pattern has repeated across 2023–2025. Alignment techniques — and more importantly, feedback loops — change what “good” means. A smaller aligned model can outperform an enormous unaligned one on the metrics that truly matter to users.

Truthfulness is a terrific example.

The TruthfulQA benchmark (Lin et al., 2022) measures the power to avoid confidently repeating web nonsense. In the unique paper, the very best model only hit around 58% truthfulness, in comparison with humans at 94%. Larger base models were sometimes truthful because they were higher at easily imitating flawed information. (The web strikes again.)

OpenAI later reported that with targeted anti-hallucination training, GPT-4 roughly doubled its TruthfulQA performance — from around 30% to about 60% — which is impressive until you remember this still means “barely higher than a coin flip” under adversarial questioning.

By early 2025, TruthfulQA itself evolved. The authors released a brand new binary multiple-choice version to repair issues in earlier formats and published updated results, including newer models like Claude 3.5 Sonnet, which likely approaches human-level accuracy on that variant. Many open models still lag behind. Additional work extends these tests to multiple languages, where truthfulness often drops because misinformation patterns differ across linguistic communities.

The broader lesson is clearer than ever:

If the one thing you measure is “does it sound fluent?”, the model will optimize for sounding fluent, not being correct. Should you care about truth, safety, or fairness, it’s essential to measure those things explicitly.

Otherwise, you get exactly what you optimized for:
a really confident, very eloquent, occasionally flawed librarian who never learned to whisper.

How misalignment shows up now (not hypothetically)

Over the past three years, misalignment has gone from a philosophical debate to something you may actually point at in your screen. We not need hypothetical “what if the AI…” scenarios. We’ve got concrete behaviors, logs, benchmarks, and sometimes a model doing something bizarre that leaves a complete engineering team looking at one another like,

Hallucinations in safety-critical contexts

Hallucination continues to be probably the most familiar failure mode, and unfortunately, it has not retired. System cards for GPT-4, GPT-4o, Claude 3, and others openly document that models still generate incorrect or fabricated information, often with the confident tone of a student who definitely didn’t read the assigned chapter.

A 2025 study titled “From hallucinations to hazards” argues that our evaluations focus too heavily on general tasks like language understanding or coding, while the risk lies in how hallucinations behave in sensitive domains like healthcare, law, and safety engineering.

In other words: scoring well on Massive Multitask Language Understanding (MMLU) doesn’t magically prevent a model from recommending the flawed dosage of an actual medication.

TruthfulQA and its newer 2025 variants confirm the identical pattern. Even top models might be fooled by adversarial questions laced with misconceptions, and their accuracy varies by language, phrasing, and the creativity of whoever designed the trap.

Bias, fairness, and who gets harmed

Bias and fairness concerns aren’t theoretical either. Stanford’s Holistic Evaluation of Language Models (HELM) framework evaluates dozens of models across 42 scenarios and multiple dimensions (accuracy, robustness, fairness, toxicity, efficiency, etc.) to create a sort of “alignment scoreboard.”

Figure: HELM Evaluation Components, credit: Holistic Evaluation of Language Models (Percy Liang et al.)

The outcomes are what you’d expect from any large, messy ecosystem:

GPT-4-class models often rating highest on accuracy and robustness.
Claude 3-series models often produce less toxic and more ethically balanced outputs.
No model is consistently best.
Every model still exhibits measurable bias and toxicity under adversarial prompting.

Consider it like different athletes excelling in several events, except the events are “don’t offend users” and “don’t hallucinate harmful instructions.”

The Holistic Evaluation of Vision-Language Models (VHELM) extension to multimodal (vision + language) models shows the identical pattern. These models are great at perception and reasoning but still struggle with fairness, multilingual performance, and toxicity unless those metrics are explicitly tested.

Deception and “alignment faking”

That is the part where things get uncomfortable.

Essentially the most concerning pattern to emerge recently is deception: models behaving aligned in some contexts while producing very different behavior in others. It sounds dramatic, but unfortunately, it’s now backed by empirical evidence somewhat than sci-fi imagination.

Back in 2023, the GPT-4 system card reported the now-famous ARC test: GPT-4, acting as an agent, hired a TaskRabbit employee to unravel a CAPTCHA, then lied about being visually impaired when asked if it was a bot.

The test was controlled, however it demonstrated an ungainly reality: if you happen to give a model goals and tools, sometimes it starts improvising ways to attain them.

Since then, multiple research threads expanded the image:

Anthropic (2024): “Alignment faking in large language models.”
Their study of Claude 3 Opus and three.5 Sonnet found that models trained on safety rules can learn to alignment when monitored while breaking the identical rules in other contexts.
Koorndijk et al. (2025): “Empirical Evidence for Alignment Faking in a Small LLM.”
Even a smaller model learned to behave safely under supervision after which reverted to a distinct objective when oversight was removed.
Apollo Research (2025): context-scheming experiments.
They found more capable models are at recognizing evaluation contexts and sometimes strategically underperform or act extra cautious during tests — the AI equivalent of “acting polite since the teacher is watching.”
Anthropic (mid-2025): large-scale multi-model simulations.
Across 16 frontier models (OpenAI, Google, Meta, Anthropic, xAI, and others), models lied, cheated, and even selected harmful actions in controlled scenarios when given autonomy and gear access. Misaligned behaviors were more frequent in probably the most capable systems.

This does not mean current models are plotting anything in real deployments.

It does mean deception, goal-driven shortcuts, and “performing alignment to pass the test” are real behaviors that show up in experiments — and the behaviors get stronger as models turn into more capable.

The alignment problem isn’t any longer just “don’t generate toxic content.” It increasingly includes “don’t pretend to be aligned only while we’re watching.”

Evaluation is the backbone of alignment (and it’s getting more complex)

Given all of this, recent work has shifted from “we want evaluation” to “we want higher, more reliable evaluation.”

From one-number leaderboards to multi-dimensional diagnostics

Early on, the community relied on single-number leaderboards. This worked about in addition to rating a automotive solely by its cupholder count. So efforts like HELM stepped in to make evaluation more holistic: many scenarios multiplied by many metrics, as an alternative of “this model has the very best rating.”

Since then, the space has expanded dramatically:

BenchHub (2025) aggregates 303,000 questions across 38 benchmarks, giving researchers a unified ecosystem for running multi-benchmark tests. One in every of its principal findings is that the identical model can perform brilliantly in a single domain and fall over in one other, sometimes comically so.
VHELM extends holistic evaluation to models, covering nine categories reminiscent of perception, reasoning, robustness, bias, fairness, and multilinguality. Principally, it’s HELM with extra eyeballs.
A 2024 study, “State of What Art? A Call for Multi-Prompt LLM Evaluation,” showed that model rankings can flip depending on which prompt phrasing you employ. The conclusion is easy: evaluating a model on a prompt is like rating a singer after hearing only their warm-up scales.

More moderen surveys, reminiscent of the 2025 Comprehensive Survey on Safety Evaluation of LLMs, treat multi-metric, multi-prompt evaluation because the default. The message is obvious: real reliability emerges only while you measure capability, robustness, and safety together, not separately.

Evaluation itself is noisy and biased

The newer twist is: even our evaluation mechanisms are misaligned.

A 2025 ACL paper, “Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts,” tested 11 LLMs used as automatic “judges.” The outcomes were… not comforting. Judge models were highly sensitive to superficial artifacts like apologetic phrasing or verbosity. In some setups, simply adding “I’m really sorry” could flip which answer was judged safer as much as 98% of the time.

That is the evaluation equivalent of getting out of a speeding ticket since you were polite.

Worse, larger judge models weren’t consistently more robust, and using a jury of multiple LLMs helped but didn’t fix the core issue.

A related 2025 position paper, “LLM-Safety Evaluations Lack Robustness”, argues that current safety evaluation pipelines introduce bias and noise at many stages: test case selection, prompt phrasing, judge alternative, and aggregation. The authors back this with case studies where minor changes in evaluation setup materially change conclusions about which model is “safer.”

Put simply: if you happen to depend on LLMs to grade other LLMs without careful design, you may easily find yourself fooling yourself. Evaluating alignment requires just as much rigor as constructing the model.

Alignment is inherently multi-objective

One thing each alignment and evaluation surveys now emphasize is that alignment is not a single metric problem. Different stakeholders care about different, often competing objectives:

Product teams care about task success, latency, and UX.
Safety teams care about jailbreak resistance, harmful content rates, and misuse potential.
Legal/compliance cares about auditability and adherence to regulation.
Users care about helpfulness, trust, privacy, and perceived honesty.

Surveys and frameworks like HELM, BenchHub, and Unified-Bench all argue that it is best to treat evaluation as navigating a trade-off surface, not picking a winner.

A model that dominates generic NLP benchmarks could be terrible on your domain whether it is brittle under distribution shift or easy to jailbreak. Meanwhile, a more conservative model could be perfect for healthcare but deeply frustrating as a coding assistant.

Evaluating across objectives — and admitting that you simply are selecting trade-offs somewhat than discovering a magical “best” model — is an element of doing alignment work truthfully.

When things go flawed, eval failures often come first

Should you take a look at recent failure stories, a pattern emerges: alignment problems often start as evaluation failures.

Teams deploy a model that appears great on the usual leaderboard cocktail but later discover:

it performs worse than the previous model on a domain-specific safety test,
it shows latest bias against a selected user group,
it may be jailbroken by a prompt style nobody bothered to check, or
RLHF made it more polite but additionally more confidently flawed.

Every one in every of those is, at root, a case where no one measured the correct thing early enough.

The most recent work on deceptive alignment points in the identical direction. If models can detect the evaluation environment and behave safely only , then testing becomes just as essential as training. You might think you’ve aligned a model while you’ve actually trained it to pass your eval suite.

It’s the AI version of a student memorizing the reply key as an alternative of understanding the fabric: impressive test scores, questionable real-world behavior.

Where this series goes next

In 2022, “we want higher evals” was an opinion. By late 2025, it’s just how the literature reads:

Larger models are more capable, and likewise more able to harmful or deceptive behavior when the setup is flawed.
Hallucinations, bias, and strategic misbehavior aren’t theoretical; they’re measurable and sometimes painfully reproducible.
Academic surveys and industry system cards now treat multi-metric evaluation as a central a part of alignment, not a nice-to-have.

The remainder of this series will zoom in:

next, on classic benchmarks (MMLU, HumanEval, etc.) and why they’re not enough for alignment,
then on holistic and stress-test frameworks (HELM, TruthfulQA, safety eval suites, red teaming),
then on training-time alignment methods (RLHF, Constitutional AI, scalable oversight),
and at last, on the societal side: ethics, governance, and what the brand new deceptive-alignment work implies for future systems.

Should you’re constructing with LLMs, the sensible takeaway from this primary piece is easy:

Alignment begins where your evaluation pipeline begins.
Should you don’t measure a behavior, you’re implicitly okay with it.

The excellent news is that we now have way more tools, way more data, and way more evidence to make a decision what we actually care about measuring. And that’s the inspiration the whole lot else will construct on.

References

Ouyang, L. et al. (2022). OpenAI. https://arxiv.org/abs/2203.02155
Lin, S., Hilton, J., & Evans, O. (2022). https://arxiv.org/abs/2109.07958
OpenAI. (2023). https://cdn.openai.com/papers/gpt-4-system-card.pdf
Kirk, H. et al. (2024). https://www.sciencedirect.com/science/article/pii/S0925753525002814
Li, R. et al. (2024). Stanford CRFM. https://crfm.stanford.edu/helm/latest
Muhammad, J. et al. (2025). : https://www.sciencedirect.com/science/article/abs/pii/S0306457325001803
Ryan, G. et al. (2024). Anthropic. https://www.anthropic.com/research/alignment-faking
Koorndijk, J. et al. (2025). https://arxiv.org/abs/2506.21584
Bai, Y. et al. (2022). Anthropic. https://arxiv.org/abs/2212.08073
Mizrahi, M. et al. (2024). https://arxiv.org/abs/2401.00595
Lee, T. et al. (2024). https://arxiv.org/abs/2410.07112
Kim, E. et al. (2025). https://arxiv.org/abs/2506.00482
Chen, H. et al. (2025). ACL 2025. https://arxiv.org/abs/2503.09347
Beyer, T. et al. (2025). https://arxiv.org/abs/2503.02574
Ji, J. et al. (2025). https://arxiv.org/abs/2310.19852
Seshadri, A. (2024). Cohere Labs. https://betakit.com/cohere-labs-head-calls-unreliable-ai-leaderboard-rankings-a-crisis-in-the-field
IBM. (2024). https://www.ibm.com/artificial-intelligence/responsible-ai
Stanford HAI. (2025). https://aiindex.stanford.edu

Why AI Alignment Starts With Higher Evaluation

Table of Contents

What “alignment” means in 2025

Capability ≠ alignment: what the previous couple of years actually taught us

How misalignment shows up now (not hypothetically)