look quite the identical as before. As a software engineer within the AI space, my work has been a hybrid of software engineering, AI engineering, product intuition, and doses of user empathy.
With a lot occurring, I desired to take a step back and reflect on the larger picture, and the type of skills and mental models engineers must stay ahead. A recent read of gave me the nudge to also desired to deep dive into learn how to take into consideration evals — a core component in any AI system.
Outside of research labs like OpenAI or Anthropic, most of us aren’t training models from scratch. The actual work is about solving business problems with the tools we have already got — giving models enough relevant context, using APIs, constructing RAG pipelines, tool-calling — all on top of the standard SWE concerns like deployment, monitoring and scaling.
In other words, AI engineering isn’t replacing software engineering — it’s layering recent complexity on top of it.
This piece is me teasing out a few of those themes. If any of them resonates, I’d love to listen to your thoughts — be at liberty to achieve out here!
The three layers of an AI application stack
Consider an AI app as being built on three layers: 1) Application development 2) Model development 3) Infrastructure.
Most teams start from the highest. With powerful models available off the shelf, it often is sensible to start by specializing in constructing the product and only later dip into model development or infrastructure as needed.
As O’Reilly puts it,
Why evals matter and why they’re tough
In software, one in all the largest headaches for fast-moving teams is regressions. You ship a brand new feature, and in the method unknowingly break something else. Weeks later, a bug surfaces in a dusty corner of the codebase, and tracing it back becomes a nightmare.
Having a comprehensive test suite helps catch these regressions.
AI development faces an identical problem. Every change — whether it’s prompt tweaks, RAG pipeline updates, fine-tuning, or context engineering — can improve performance in a single area while quietly degrading one other.
But evaluating AI isn’t straightforward. Firstly, the more intelligent models change into, the harder evaluation gets. It’s easy to inform if a book summary is bad if it’s gibberish, but much harder if the summary is definitely coherent. o know whether it’s actually capturing the important thing points, not only sounding fluent or factually correct, you would possibly should read the book yourself.
Secondly, tasks are sometimes open-ended. There’s rarely a single “right” answer and not possible to curate a comprehensive list of correct outputs.
Thirdly, foundation models are treated as black boxes, where details of model architecture, training data and training process are sometimes scrutinised and even made public. These details reveal alot a couple of model’s strengths and weaknesses and without it, people only evaluate models based by observing it’s outputs.
Methods to take into consideration evals
I prefer to group evals into two broad realms: quantitative and qualitative.
Quantitative evals have clear, unambiguous answers. Did the maths problem get solved accurately? Did the code execute without errors? These can often be tested mechanically, which makes them scalable.
Qualitative evals, however, live within the grey areas. They’re about interpretation and judgment — like grading an essay, assessing the tone of a chatbot, or deciding whether a summary “sounds right.”
Most evals are a mixture of each. For instance, evaluating a generated website means not only testing whether it performs its intended functions (quantitative: can a user join, log in, etc.), but additionally judging whether the user experience feels intuitive (qualitative).
Functional correctness
At the guts of quantitative evals is functional correctness: does the model’s output actually do what it’s speculated to do?
Should you ask a model to generate an internet site, the core query is whether or not the location meets its requirements. Can a user complete key actions? Does it work reliably? This looks lots like traditional software testing, where you run a product against a collection of test cases to confirm behaviour. Often, this may be automated.
Similarity against reference data
Not all tasks have such clear, testable outputs. Translation is an excellent example: there’s no single “correct” English translation for a French sentence, but you’ll be able to compare outputs against reference data.
The downside: This relies heavily on the provision of reference datasets, that are expensive and time-consuming to create. Human-generated data is taken into account the gold standard, but increasingly, reference data is being bootstrapped by other AIs.
There are a number of ways to measure similarity:
- Human judgement
- Exact match: whether the generated response matches one in all the reference responses exactly. These produces boolean results.
- Lexical similarity: measuring how similar the outputs look (e.g., overlap in words or phrases).
- Semantic similarity: measuring whether the outputs mean the identical thing, even when the wording is different. This often involves turning data into embeddings (numerical vectors) and comparing them. Embeddings aren’t only for text — platforms like Pinterest use them for images, queries, and even user profiles.
Lexical similarity only checks surface-level resemblance, while semantic similarity digs deeper into meaning.
AI as a judge
Some tasks are nearly not possible to judge cleanly with rules or reference data. Assessing the tone of a chatbot, judging the coherence of a summary, or critiquing the persuasiveness of ad copy all fall into this category. Humans can do it, but human evals don’t scale.
Here’s learn how to structure the method:
- Define a structured and measurable evaluation criteria. Be explicit about what you care about — clarity, helpfulness, factual accuracy, tone, etc. Criteria can use a scale (1–5 rating) or binary checks (pass/fail).
- The unique input, the generated output, and any supporting context are given to the AI judge. A rating, label and even a proof for evaluation is then returned by the judge.
- Aggregate over many outputs. By running this process across large datasets, you’ll be able to uncover patterns — for instance, noticing that helpfulness dropped 10% after a model update.
Because this may be automated, it enables continuous evaluation, borrowing from CI/CD practices in software engineering. Evals may be run before and after pipeline changes (from prompt tweaks to model upgrades), or used for ongoing monitoring to catch drift and regressions.
In fact, AI judges aren’t perfect. Just as you wouldn’t fully trust a single person’s opinion, you shouldn’t fully trust a model’s either. But with careful design, multiple judge models, or running them over many outputs, they will provide
Eval driven development
O’Reilly talked in regards to the concept of eval-driven development, inspired by test-driven development in software engineering, something I felt is price sharing.
The concept is straightforward: Define your evals before you construct.
In AI engineering, this implies deciding what “success” looks like and the way it’ll be measured.
Impact still matters most — not hype. The proper evals make sure that AI apps reveal value in ways which can be relevant to users and the business.
When defining evals, listed below are some key considerations:
Domain knowledge
Public benchmarks exist across many domains — code debugging, legal knowledge, tool use — but they’re often generic. Probably the most meaningful evals often come from sitting down with stakeholders and defining what truly matters for the business, then translating that into measurable outcomes.
Correctness isn’t enough if the answer is impractical. For instance, a text-to-SQL model might generate an accurate query, but when it takes 10 minutes to run or consumes huge resources, it’s not useful at scale. Runtime and memory usage are necessary metrics too.
Generation capability
For generative tasks — whether text, image, or audio — evals may include fluency, coherence, and task-specific metrics like relevance.
A summary is perhaps factually accurate but miss a very powerful points — an eval should capture that. Increasingly, these qualities can themselves be scored by one other AI.
Factual consistency
Outputs should be checked against a source of truth. This may occur in two ways:
- Local consistency
This implies verifying outputs against a provided context. This is particularly useful for specific domains which can be unique to themselves and have limited scope. As an example, extracted insights must be consistent with the information. - Global consistency
This implies verifying outputs against open knowledge sources comparable to by fact checking via an internet search or a market research and so forth. - Self verification
This happens when a model generates multiple outputs, and measures how consistent these responses are with one another.
Safety
Beyond the standard concept of safety comparable to to not include profanity and explicit content, there are literally some ways by which safety may be defined. As an example, chatbots mustn’t reveal sensitive customer data and may find a way to protect against prompt injection attacks.
To sum up
As AI capabilities grow, robust evals will only change into more necessary. They’re the guardrails that permit engineers move quickly without sacrificing reliability.
I’ve seen how difficult reliability may be and the way costly regressions are. They damage an organization’s status, frustrate users, and create painful dev experiences, with engineers stuck chasing the identical bugs again and again.
Because the boundaries between engineering roles blur, especially in smaller teams, we’re facing a fundamental shift in how we take into consideration software quality. The necessity to keep up and measure reliability now extends beyond rule-based systems to those which can be inherently probabilistic and stochastic.