Evaluation-Driven Development for LLM-Powered Products: Lessons from Constructing in Healthcare

in the sphere of enormous language models (LLM) and their applications is very rapid. Costs are coming down and foundation models have gotten increasingly capable, capable of handle communication in text, images, video. Open source solutions have also exploded in diversity and capability, with many models being lightweight enough to explore, fine-tune and iterate on without huge expense. Finally, cloud AI training and inference providers equivalent to Databricks and Nebius are making it increasingly easy for organizations to scale up their applied AI products from POCs to production ready systems. These advances go hand in hand with a diversification of the business uses of LLMs and the rise of agentic applications, where models plan and execute complex multi-step workflows that will involve interaction with tools or other agents. These technologies are already making an impact in healthcare and that is projected to grow rapidly [1].

All of this capability makes it exciting to start, and constructing a baseline solution for a specific use case will be very fast. Nevertheless, by their nature LLMs are non-deterministic and fewer predictable than traditional software or ML models. The true challenge due to this fact is available in iteration: How can we know that our development process is improving the system? If we fix an issue, how can we know if the change won’t break something else? Once in production, how can we check if performance is on par with what we saw in development? Answering these questions with systems that make single LLM calls is difficult enough, but with agentic systems we also need to think about all the person steps and routing decisions made between them. To deal with these issues — and due to this fact gain trust and confidence within the systems we construct — we want evaluation-driven development. That is a strategy that places iterative, actionable evaluation on the core of the product lifecycle from development and deployment to monitoring.

As a knowledge scientist at Nuna, Inc., a healthcare AI company, I’ve been spearheading our efforts to embed evaluation-driven development into our products. With the support of our leadership, we’re sharing among the key lessons we’ve learned to this point. We hope these insights can be helpful not only to those constructing AI in healthcare but in addition to anyone developing AI products, especially those just starting their journey.

This text is broken into the next sections, which seek to clarify our broad learnings from the literature along with tricks and suggestions gained from experience.

In Section 1 we’ll briefly touch on Nuna’s products and explain why AI evaluation is so critical for us and for healthcare-focused AI basically.
In Section 2, we’ll explore how evaluation-driven development brings structure to the pre-deployment phase of our products. This involves developing metrics using each LLM-as-Judge and programmatic approaches, that are heavily inspired by this excellent article. Once reliable judges and expert-aligned metrics have been established, we describe find out how to use them to iterate in the best direction using error evaluation. On this section, we’ll also touch on the unique challenges posed by chatbot applications.
In Section 3, we’ll discuss using model-based classification and alerting to observe applications in production and use this feedback for further improvements.
Section 4 summarizes all that we’ve learned

Any organization’s perspective on these subjects is influenced by the tools it uses — for instance we use MLflow and Databricks Mosaic Evaluation to maintain track of our pre-production experiments, and AWS Agent Evaluation to check our chatbot. Nevertheless, we imagine that the ideas presented here must be applicable no matter tech stack, and there are a lot of excellent options available from the likes of Arize (Phoenix evaluation suite), LangChain (LangSmith) and Confident AI (DeepEval). Here we’ll concentrate on projects where iterative development mainly involves prompt engineering, but an identical approach could possibly be followed for fine-tuned models too.

1.0 AI and evaluation at Nuna

Nuna’s goal is to cut back the full cost of care and improve the lives of individuals with chronic conditions equivalent to hypertension (hypertension) and diabetes, which together affect greater than 50% of the US adult population [2,3]. This is finished through a patient-focused mobile app that encourages healthy habits equivalent to medication adherence and blood pressure monitoring, along with a care-team-focused dashboard that organizes data from the app to providers*. To ensure that the system to succeed, each patients and care teams must find it easy to make use of, engaging and insightful. It must also produce measurable advantages to health. That is critical since it distinguishes healthcare technology from most other technology sectors, where business success is more closely tied to engagement alone.

AI plays a critical, patient and care-team-facing role within the product: On the patient side now we have an in-app care coach chatbot, in addition to features equivalent to medication containers and meal photo-scanning. On the care-team side we’re developing summarization and data sorting capabilities to cut back time to motion and tailor the experience for various users. The table below shows the 4 AI-powered product components whose development served as inspiration for this text, and which can be referred to in the next sections.

Product description	Unique characteristics	Most important evaluation components
Scanning of medication containers (image to text)	Multimodal with clear ground truth labels (medication details extracted from container)	Representative development dataset, iteration and tracking, monitoring in production
Scanning of meals (ingredient extraction, dietary insights and scoring)	Multimodal, mixture of clear ground truth (extracted ingredients) and subjective judgment of LLM-generated assessments & SME input	Representative development dataset, appropriate metrics, iteration and tracking, monitoring in production
In-app care coach chatbot (text to text)	Multi-turn transcripts, tool calling, wide range of personas and use cases, subjective judgement	Representative development dataset, appropriate metrics, monitoring in production
Medical record summarization (text & numerical data to text)	Complex input data, narrow use case, critical need for top accuracy and SME judgement	Representative development dataset, expert-aligned LLM-judge, iteration & tracking

Figure 1: Table showing the AI use cases at Nuna which can be referred to in this text. We imagine that the evaluation-driven development framework presented here is sufficiently broad to use to those and similar varieties of AI products.

Respect for patients and the sensitive data they entrust us with is on the core of our business. Along with safeguarding data privacy, we must be sure that our AI products operate in ways which can be protected, reliable, and aligned with users’ needs. We want to anticipate how people might use the products and test each standard and edge-case uses. Where mistakes are possible — equivalent to ingredient recognition from meal photographs — we want to know where to speculate in constructing ways for users to simply correct them. We also have to be looking out for more subtle failures — for instance, recent research suggests that prolonged chatbot use can result in increased feelings of loneliness — so we want to discover and monitor for concerning use cases to be sure that our AI is aligned with the goal of improving lives and reducing cost of care. This aligns with recommendations from the NIST AI Risk Management Framework, which emphasizes preemptive identification of plausible misuse scenarios, including edge cases and unintended consequences, especially in high-impact domains equivalent to healthcare.

2.0 Pre-deployment: Metrics, alignment and iteration

In the event stage of an LLM-powered product, it can be crucial to ascertain evaluation metrics which can be aligned with the business/product goals, a testing dataset that’s representative enough to simulate production behavior and a sturdy method to truly calculate the evaluation metrics. With this stuff in place, we are able to enter the virtuous cycle of iteration and error evaluation (see this short book for details). The faster we are able to iterate in the best direction, the upper our probabilities of success. It also goes without saying that at any time when testing involves passing sensitive data through an LLM, it have to be done from a secure environment with a trusted provider in accordance with data privacy regulations. For instance, in america, the Health Insurance Portability and Accountability Act (HIPAA) sets strict standards for safeguarding patients’ health information. Any handling of such data must meet HIPAA’s requirements for security and confidentiality.

2.1 Development dataset

On the outset of a project, it can be crucial to discover and have interaction with subject material experts (SMEs) who may also help generate example input data and define what success looks like. At Nuna our SMEs are consultant healthcare professionals equivalent to physicians and nutritionists. Depending on the issue context, we’ve found that opinions from healthcare experts will be nearly uniform — where the reply will be sourced from core principles of their training — or quite varied, drawing on their individual experiences. To mitigate this, we’ve found it useful to hunt advice from a small panel of experts (typically 2-5) who’re engaged from the start of the project and whose consensus view acts as our ultimate source of truth.

It’s advisable to work with the SMEs to construct a representative dataset of inputs to the system. To do that, we must always consider the broad categories of personas who is perhaps using it and the important functionalities. The broader the use case, the more of those there can be. For instance, the Nuna chatbot is accessible to all users, helps answer any wellness-based query and likewise has access to the user’s own data via tool calls. A number of the functionalities are due to this fact “emotional support”, “hypertension support”, “nutrition advice”, “app support”, and we’d consider splitting these further into “recent user” vs. “exiting user” or “skeptical user” vs. “power user” personas. This segmentation is helpful for the info generation process and error evaluation in a while, after these inputs have run through the system.

It’s also necessary to think about specific scenarios — each typical and edge-case — that the system must handle. For our chatbot these include “user asks for a diagnosis based on symptoms” (we at all times refer them to a healthcare skilled in such situations), “user ask is truncated or incomplete”, “user attempts to jailbreak the system”. After all, it’s unlikely that each one critical scenarios can be accounted for, which is why later iteration (section 2.5) and monitoring in production (section 3.0) is required.

With the categories in place, the info itself is perhaps generated by filtering existing proprietary or open source datasets (e.g. Nutrition5k for food images, OpenAI’s HealthBench for patient-clinician conversations). In some cases, each inputs and gold standard outputs is perhaps available, for instance within the ingredient labels on each image in Nutition5k. This makes metric design (section 2.3) easier. More commonly though, expert labelling can be required for the gold standard outputs. Indeed, even when pre-existing input examples will not be available, these will be generated synthetically with an LLM after which curated by the team — Databricks has some tools for this, described here.

How big should this development set be? The more examples now we have, the more likely it’s to be representative of what the model will see in production however the costlier it’s going to be to iterate. Our development sets typically start out on the order of a number of hundred examples. For chatbots, where to be representative the inputs might have to be multi-step conversations with sample patient data in context, we recommend using a testing framework like AWS Agent Evaluation, where the input example files will be generated manually or via LLM by prompting and curation.

2.2 Baseline model pipeline

If ranging from scratch, the technique of pondering through the use cases and constructing the event set will likely give the team a way for the issue of this problem and hence the architecture of the baseline system to be built. Unless made infeasible by security or cost concerns, it’s advisable to maintain the initial architecture easy and use powerful, API-based models for the baseline iteration. The important purpose of the iteration process described in subsequent sections is to enhance the prompts on this baseline version, so we typically keep them easy while attempting to adhere to general prompt engineering best practices equivalent to those described on this guide by Anthropic.

Once the baseline system is up and running, it must be run on the event set to generate the primary outputs. Running the event dataset through the system is a batch process that will have to be repeated repeatedly, so it’s price parallelizing. At Nuna we use PySpark on Databricks for this. Probably the most straightforward method for batch parallelism of this kind is the pandas user-defined function (UDF), which allows us to call the model API in a loop over rows in Pandas dataframe, after which use Pyspark to interrupt up the input dataset into chunks to be processed in parallel over the nodes of a cluster. Another method, described here, first requires us to log a script that calls the model as an mlflow PythonModel object, load that as a pandas UDF after which run inference using that.

Figure 2: High level workflow showing the technique of constructing the event dataset and metrics, with input from subject material experts (SME). Construction of the dataset is iterative. After the baseline model is run, SME critiques will be used to define optimizing and satisficing metrics and their associated thresholds for achievement. Image generated by the creator.

2.3 Metric design

Designing evaluation metrics which can be actionable and aligned with the feature’s goals is a critical a part of evaluation-driven development. Given the context of the feature you’re developing, there could also be some metrics which can be minimum requirements for ship — e.g. a minimum rate of the numerical accuracy for a text summary on a graph. Especially in a healthcare context, now we have found that SMEs are again essential resources here within the identification of additional supplementary metrics that can be necessary for stakeholder buy-in and end-user interpretation. Asynchronously, SMEs should have the ability to securely review the inputs and outputs from the event set and make comments on them. Various purpose-built tools support this type of review and will be adapted to the project’s sensitivity and maturity. For early-stage or low-volume work, lightweight methods equivalent to a secure spreadsheet may suffice. If possible, the feedback should consist of a straightforward pass/fail decision for every input/output pair, together with critique of the LLM-generated output explaining the choice. The thought is that we are able to then use these critiques to tell our selection of evaluation metrics and supply few-shot examples to any LLM-judges that we construct. Why pass/fail fairly than a likert rating or another numerical metric? This can be a developer selection, but we found it is far easier to get alignment between SMEs and LLM judges within the binary case. It is easy to aggregate results into a straightforward accuracy measure across the event set. For instance, if 30% of the “90 day blood pressure time series summaries” get a zero for groundedness but not one of the 30 day summaries do, then this points to the model fighting long inputs.

On the initial review stage, it is usually also useful to document a transparent set of guidelines around what constitutes success within the outputs, which allows all annotators to have a source of truth. Disagreements between SME annotators can often be resolved as regards to these guidelines, and if disagreements persist this will be an indication that the rules — and hence the aim of the AI system — isn’t defined clearly enough. It’s also necessary to notice that depending on your organization’s resourcing, ship timelines, and risk level of the feature, it might not be possible to get SME comments on all the development set here — so it’s necessary to decide on representative examples.

As a concrete example, Nuna has developed a medicine logging history AI summary, to be displayed within the care team-facing portal. Early in the event of this AI summary, we curated a set of representative patient records, ran them through the summarization model, plotted the info and shared a secure spreadsheet containing the input graphs and output summaries with our SMEs for his or her comments. From this exercise we identified and documented the necessity for a variety of metrics including readability, style (i.e. objective and never alarmist), formatting and groundedness (i.e. accuracy of insights against the input timeseries).

Some metrics will be calculated programmatically with easy tests on the output. This includes formatting and length constraints, and readability as quantified by scores just like the F-K grade level. Other metrics require an LLM-judge (see below for more detail) since the definition of success is more nuanced. That is where we prompt an LLM to act like a human expert, giving pass/fail decisions and critiques of the outputs. The thought is that if we are able to align the LLM judge’s results with those of the experts, we are able to run it routinely on our development set and quickly compute our metrics when iterating.

We found it useful to decide on a single “optimizing metric” for every project, for instance the proportion of the event set that’s marked as accurately grounded within the input data, but back it up with several “satisficing metrics” equivalent to percent inside length constraints, percent with suitable style, percent with readability rating > 60 etc. Aspects like latency percentile and mean token cost per request also make ideal satisficing metrics. If an update makes the optimizing metric value go up without lowering any of the satisficing metric values below pre-defined thresholds, then we all know we’re getting into the best direction.

2.4 Constructing the LLM judge

The aim of LLM-judge development is to distill the recommendation of the SMEs right into a prompt that enables an LLM to attain the event set in a way that’s aligned with their skilled judgement. The judge is often a bigger/more powerful model than the one being judged, though this isn’t a strict requirement. We found that while it’s possible to have a single LLM judge prompt output the scores and critiques for several metrics, this will be confusing and incompatible with the tracking tools described in 2.4. We due to this fact make a single judge prompt per metric, which has the additional advantage of forcing conservatism on the variety of LLM-generated metrics.

An initial judge prompt, to be run on the event set, might look something just like the block below. The instructions can be iterated on in the course of the alignment step, so at this stage they need to represent a best effort to capture the SME’s thought process when writing their criques. It’s necessary to be sure that the LLM provides its reasoning, and that that is detailed enough to know why it made the determination. We must always also double check the reasoning against its pass/fail judgement to make sure they’re logically consistent. For more detail about LLM reasoning in cases like this, we recommend this excellent article.


You're an authority healthcare skilled who's asked to judge a summary of a patient's medical data that was made by an automatic system. 

Please follow these instructions for evaluating the summaries

{detailed instructions}

Now rigorously study the next input data and output response, giving your reasoning and a pass/fail judgement in the desired output format



{data to be summarized}



{formatting instructions}

To maintain the judge outputs as reliable as possible, its temperature setting must be as little as possible. To validate the judge, the SMEs must see representative examples of input, output, judge decision and judge critique. This could preferably be a distinct set of examples than those they checked out for the metric design, and given the human effort involved on this step it may be small.

The SMEs might first give their very own pass/fail assessments for every example without seeing the judge’s version. They need to then have the ability to see all the pieces and have the chance to switch the model’s critique to turn out to be more aligned with their very own thoughts. The outcomes will be used to make modifications to the LLM judge prompt and the method repeated until the alignment between the SME assessments and model assessments stops improving, as time constraints allow. Alignment will be measured using easy accuracy or statistical measures equivalent to Cohen’s kappa. We now have found that including relevant few-shot examples within the judge prompt typically helps with alignment, and there may be also work suggesting that adding grading notes for every example to be judged can be helpful.

We now have typically used spreadsheets for this kind of iteration, but more sophisticated tools equivalent to Databrick’s review apps also exist and could possibly be adapted for LLM judge prompt development. With expert time briefly supply, LLM judges are very necessary in healthcare AI and as foundation models turn out to be more sophisticated, their ability to face in for human experts appears to be improving. OpenAI’s HealthBench work, for instance, found that physicians were generally unable to enhance the responses generated by April 2025’s models and that when GPT4.1 is used as a grader on healthcare-related problems, its scores are thoroughly aligned with those of human experts [4].

Figure 3: High level workflow showing how the event set (section 2.1) is used to construct and align LLM judges. Experiment tracking is used for the evolution loop, which involves calculating metrics, refining the model, regenerating the output and re-running the judges. Image generated by the creator.

2.5 Iteration and tracking

With our LLM judges in place, we’re finally in a very good position to begin iterating on our important system. To achieve this, we’ll systematically update the prompts, regenerate the event set outputs, run the judges, compute the metrics and do a comparison between the brand new and old results. That is an iterative process with potentially many cycles, which is why it advantages from tracing, prompt logging and experiment tracking. The technique of regenerating the event dataset outputs is described in section 2.1, and tools like MLflow make it possible to trace and version the judge iterations too. We use Databricks Mosaic AI Agent Evaluation, which provides a framework for adding custom Judges (each LLM and programmatic), along with several built-in ones with pre-defined prompts (we typically turn these off). In code, the core evaluation commands appear to be this


with mlflow.start_run(
    run_name=run_name,
    log_system_metrics=True,
    description=run_description,
) as run:

    # run the programmatic tests

    results_programmatic = mlflow.evaluate(
        predictions="response",
        data=df,  # df accommodates the inputs, outputs and any relevant context, as a pandas dataframe
        model_type="text",
        extra_metrics=programmatic_metrics,  # list of custom mlflow metrics, each with a function describing how the metric is calculated
    )

    # run the llm judge with the extra metrics we configured
    # note that here we also include a dataframe of few-shot examples to
    # help guide the LLM judge.

    results_llm = mlflow.evaluate(
        data=df,
        model_type="databricks-agent",
        extra_metrics=agent_metrics,  # agent metrics is an inventory of custom mlflow metrics, each with its own prompt
        evaluator_config={
            "databricks-agent": {
                "metrics": ["safety"],  # only keep the “safety” default judge
                "examples_df": pd.DataFrame(agent_eval_examples),
            }
        },
    )

    # Also log the prompts (judge and important model) and some other useful artifacts equivalent to plots the outcomes together with the run

Under the hood, MLflow will issue parallel calls to the judge models (packaged within the agent metrics list within the code above) and likewise call the programmatic metrics with relevant functions (within the programmatic metrics list), saving the outcomes and relevant artifacts to Unity Catalog and likewise providing a pleasant user interface with which to match metrics across experiments, view traces and browse the LLM judge critiques. It must be noted MLflow 3.0, released just after this was written, and has some tooling that will simplify the code above.

To identity improvements with highest ROI, we are able to revisit the event set segmentation into personas, functionalities and situations described in section 2.1. We will compare the worth of the optimizing metric between segments and decide to focus our prompt iterations on the one with the bottom scores, or with essentially the most concerning edge cases. With our evaluation loop in place, we are able to catch any unintended consequences of model updates. Moreover, with tracking we are able to reproduce results and revert to previous prompt versions if needed.

2.6 When is it ready for production?

In AI applications, and healthcare specifically, some failures are more consequential than others. We never want our chatbot to assert that it’s a healthcare skilled, for instance. However it is inevitable that our meal scanner will make mistakes identifying ingredients in uploaded images — humans will not be particularly good at identifying ingredients from a photograph, and so even human-level accuracy can contain frequent mistakes. It’s due to this fact necessary to work with the SMEs and product stakeholders to develop realistic thresholds for the optimizing metrics, above which the event work will be declared successful to enable migration into production. Some projects may fail at this stage since it’s impossible to push the optimizing metrics above the agreed threshold without compromising the satisificing metrics or due to resource constraints.

If the thresholds are very high then missing them barely is perhaps acceptable due to unavoidable error or ambiguity within the LLM judge. For instance we initially set a ship requirement of 100% of our development set health record summaries to be graded as “accurately grounded.” We then found that the LLM-judge occasionally would quibble over statements like, “the patient has recorded their blood pressure on most days of the last week”, when the actual variety of days with recordings was 4. In our judgement, an inexpensive end-user wouldn’t find this statement troubling, despite the LLM-as-judge classifying it as a failure. Thorough manual review of failure cases is significant to discover whether the performance is definitely acceptable and/or whether further iteration is required.

These go/no-go decisions also align with the NIST AI Risk Management Framework, which inspires context-driven risk thresholds and emphasizes traceability, validity, and stakeholder-aligned governance throughout the AI lifecycle.

Even with a temperature of zero, LLM judges are non-deterministic. A reliable judge should give the identical determination and roughly the identical critique each time it’s on a given example. If this isn’t happening, it suggests that the judge prompt must be improved. We found this issue to be particularly problematic in chatbot testing with the AWS Evaluation Framework, where each conversation to be graded has a custom rubric and the LLM generating the input conversations has some leeway on the precise wording of the “user messages”. We due to this fact wrote a straightforward script to run each test multiple times and record the common failure rate. Tests with failure at a rate that isn’t 0 or 100% will be marked as unreliable and updated until they turn out to be consistent.This experience highlights the restrictions of LLM judges and automatic evaluation more broadly. It reinforces the importance of incorporating human review and feedback before declaring a system ready for production. Clear documentation of performance thresholds, test results, and review decisions supports transparency, accountability, and informed deployment.

Along with performance thresholds, it’s necessary to evaluate the system against known security vulnerabilities. The OWASP Top 10 for LLM Applications outlines common risks equivalent to prompt injection, insecure output handling, and over-reliance on LLMs in high-stakes decisions, all of that are highly relevant for healthcare use cases. Evaluating the system against this guidance may also help mitigate downstream risks because the product moves into production.

3.0 Post-deployment: Monitoring and classification

Moving an LLM application from development to deployment in a scalable, sustainable and reproducible way is a posh undertaking and the topic of fantastic “LLMOps” articles like this one. Having a process like this, which operationalizes each stage of the info pipeline, may be very useful for evaluation-driven development since it allows for brand new iterations to be quickly deployed. Nevertheless, on this section we’ll focus mainly on find out how to actually use the logs generated by an LLM application running in production to know the way it’s performing and inform further development.

One major goal of monitoring is to validate that the evaluation metrics defined in the event phase behave similarly with production data, which is actually a test of the representativeness of the event dataset. This could first ideally be done internally in a dogfooding or “bug bashing” exercise, with involvement from unrelated teams and SMEs. We will re-use the metric definitions and LLM judges inbuilt development here, running them on samples of production data at periodic intervals and maintaining a breakdown of the outcomes. For data security at Nuna, all of this is finished inside Databricks, which allows us to benefit from Unity Catalog for lineage tracking and dashboarding tools for simple visualization.

Monitoring on LLM-powered products is a broad topic, and our focus here is on how it may be used to finish the evaluation-driven development loop in order that the models will be improved and adjusted for drift. Monitoring must also be used to trace broader “product success” metrics equivalent to user-provided feedback, user engagement, token usage, and chatbot query resolution. This excellent article accommodates more details, and LLM judges will also be deployed on this capability — they might undergo the identical development process described in section 2.4.

This approach aligns with the NIST AI Risk Management Framework (“AI RMF”), which emphasizes continuous monitoring, measurement, and documentation to administer AI risk over time. In production, where ambiguity and edge cases are more common, automated evaluation alone is usually insufficient. Incorporating structured human feedback, domain expertise, and transparent decision-making is crucial for constructing trustworthy systems, especially in high-stakes domains like healthcare. These practices support the AI RMF’s core principles of governability, validity, reliability, and transparency.

Figure 4: High level workflow showing components of the post-deployment data pipeline that enables for monitoring, alerting, tagging and evaluation of the model outputs in production. This is crucial for evaluation-driven development, since insights will be fed back into the event stage. Image generated by the creator.

3.1 Additional LLM classification

The concept of the LLM judge will be prolonged to post-deployment classification, assigning tags to model outputs and giving insights about how applications are getting used “within the wild”, highlighting unexpected interactions and alerting about concerning behaviors. Tagging is the technique of assigning easy labels to data so that they’re easier to segment and analyze. This is especially useful for chatbot applications: If users on a certain Nuna app version start asking our chatbot questions on our blood pressure cuff, for instance, this will point to a cuff setup problem. Similarly, if certain sorts of medication container are resulting in higher than average failure rates from our medication scanning tool, this implies the necessity to research and possibly update that tool.

In practice, LLM classification is itself a development project of the kind described in section 2. We want to construct a tag taxonomy (i.e. an outline of every tag that could possibly be assigned) and prompts with instructions about find out how to use it, then we want to make use of a development set to validate tagging accuracy. Tagging often involves generating consistently formatted output to be ingested by a downstream process — for instance an inventory of topic ids for every chatbot conversation segment — which is why enforcing structured output on the LLM calls will be very helpful here, and Databricks has an example of how that is will be done at scale.

For long chatbot transcripts, LLM classification will be adapted for summarization to enhance readability and protect privacy. Conversation summaries can then be vectorized, clustered and visualized to achieve an understanding of groups that naturally emerge from the info. This is usually step one in designing a subject classification taxonomy equivalent to the one the Nuna uses to tag our chats. Anthropic has also built an internal tool for similar purposes, which reveals fascinating insights into usage patterns of Claude and is printed of their Clio research article.

Depending on the urgency of the knowledge, tagging can occur in real time or as a batch process. Tagging that appears for concerning behavior — for instance flagging chats for immediate review in the event that they describe violence, illegal activities or severe health issues — is perhaps best suited to a real-time system where notifications are sent as soon as conversations are tagged. Whereas more general summarization and classification can probably afford to occur as a batch process that updates a dashboard, and possibly only on a subset of the info to cut back costs. For chat classification, we found that including an “other” tag for the LLM to assign to examples that don’t fit neatly into the taxonomy may be very useful. Data tagged as “other” can then be examined in additional detail for brand new topics so as to add to the taxonomy.

3.2 Updating the event set

Monitoring and tagging grant visibility into application performance, but also they are a part of the feedback loop that drives evaluation driven development. As recent or unexpected examples are available and are tagged, they will be added to the event dataset, reviewed by the SMEs and run through the LLM judges. It’s possible that the judge prompts or few-shot examples might have to evolve to accommodate this recent information, however the tracking steps outlined in section 2.4 should enable progress without the danger of confusing or unintended overwrites. This completes the feedback loop of evaluation-driven development and enables confidence in LLM products not only once they ship, but in addition as they evolve over time.

4.0 Summary

The rapid evolution of enormous language models (LLMs) is transforming industries and offers great potential to profit healthcare. Nevertheless, the non-deterministic nature of AI presents unique challenges, particularly in ensuring reliability and safety in healthcare applications.

At Nuna, Inc., we’re embracing evaluation-driven development to handle these challenges and drive our approach to AI products. In summary, the concept is to emphasise evaluation and iteration throughout the product lifecycle, from development to deployment and monitoring.

Our methodology involves close collaboration with subject material experts to create representative datasets and define success criteria. We concentrate on iterative improvement through prompt engineering, supported by tools like MLflow and Databricks, to trace and refine our models.

Post-deployment, continuous monitoring and LLM tagging provide insights into real-world application performance, enabling us to adapt and improve our systems over time. This feedback loop is crucial for maintaining high standards and ensuring AI products proceed to align with our goals of improving lives and decreasing cost of care.

In summary, evaluation-driven development is crucial for constructing reliable, impactful AI solutions in healthcare and elsewhere. By sharing our insights and experiences, we hope to guide others in navigating the complexities of LLM deployment and contribute to the broader goal of improving efficiency of AI project development in healthcare.

References

[1] Boston Consulting Group, Digital and AI Solutions to Reshape Health Care (2025), https://www.bcg.com/publications/2025/digital-ai-solutions-reshape-health-care-2025

[2] Centers for Disease Control and Prevention, High Blood Pressure Facts (2022), https://www.cdc.gov/high-blood-pressure/data-research/facts-stats/index.html

[3] Centers for Disease Control and Prevention, Diabetes Data and Research (2022), https://www.cdc.gov/diabetes/php/data-research/index.html

[4] R.K. Arora, et al. HealthBench: Evaluating Large Language Models Towards Improved Human Health (2025), OpenAI

Authorship

This text was written by Robert Martin-Short, with contributions from the Nuna team: Kate Niehaus, Michael Stephenson, Jacob Miller & Pat Alberts

Evaluation-Driven Development for LLM-Powered Products: Lessons from Constructing in Healthcare

1.0 AI and evaluation at Nuna