Validating AI systems requires benchmarks—datasets and evaluation workflows that mimic real-world conditions—to measure accuracy, reliability, and safety before deployment. Without them, you’re guessing.
But in regulated domains akin to healthcare, finance, and government, data scarcity and privacy constraints make constructing benchmarks incredibly difficult. Real-world data is locked behind confidentiality agreements, is fragmented across silos, or is prohibitively expensive to annotate. The result? Innovation stalls, and evaluation becomes guesswork. For instance, government agencies deploying AI assistants for citizen services—like tax filing, advantages, or permit applications—need robust evaluation benchmarks without exposing personally identifiable information (PII) from real citizen records.
This blog introduces an AI-driven, privacy-preserving evaluation workflow that might be applied across industries to benchmark LLMs safety and efficiency. We’ll use a healthcare example as an instance the method, but the identical approach works for any domain where data privacy is critical. You’ll learn how you can generate domain-specific synthetic datasets in minutes using NVIDIA NeMo Data Designer and construct reproducible benchmarks with NVIDIA NeMo Evaluator—without exposing a single real record.
Quick links to the model and code
What you’ll get in the long run: a privacy-preserving data-evaluation pipeline
This blog demonstrates how you can construct a privacy-preserving evaluation workflow where sensitive data have to be protected.
You’ll learn how you can:
- Generate realistic, privacy-safe triage notes based on structured prompts and domain constraints.
- Rating and filter synthetic data for quality.
- Evaluate large language model (LLM) predictions using automated benchmarks across multiple GPUs.
For example the method, we’ll use one real-world example—predicting an Emergency Severity Index (ESI) for ER triage notes—without exposing a single patient record.
Example: synthetic data for emergency room triage prediction
Emergency departments operate under intense pressure. Every second counts, and accurate triage determines whether a patient gets immediate care or waits. AI can assist by predicting ESI levels from clinical notes, enabling faster prioritization and reducing clinician workload (See Figure 1 below). But constructing such a system isn’t straightforward. Limitations include:
- Data access: Real triage notes are confidential and guarded under Health Insurance Portability and Accountability Act (HIPAA) and other privacy regulations. Hospitals cannot simply share patient records for model training or evaluation. Even when limited datasets can be found, they’re often incomplete, inconsistent, or locked behind institutional agreements.
- Annotation cost: Labeling hundreds of notes with ESI levels requires clinical expertise. Manual annotation is slow, expensive, and liable to variability. For a lot of developers, this step alone can stall a project for months.
- Data scarcity: Rare conditions and edge cases are underrepresented in real-world datasets, making it hard to construct models that generalize. Without enough examples, models risk bias and brittle performance—unacceptable in life-critical environments like emergency care.
These challenges create a paradox: AI could transform triage, however the very data needed to construct and validate these systems is inaccessible. That is where synthetic data and automatic evaluation workflows are available.


Why synthetic data matters
Synthetic data has rapidly matured into an important resource for constructing reliable AI systems. Unlike real-world data, which is proscribed by what has happened, synthetic data means that you can generate what could occur—covering rare edge cases and diverse scenarios that strictly comply with privacy regulations. It offers high-quality, domain-specific examples created with guidance from material experts and advanced compound AI systems. In industries where privacy, compliance, and data scarcity limit access to real-world data, synthetic data provides a breakthrough—enabling teams to coach and validate models safely and efficiently.
Unlike traditional data collection, the means of generating synthetic data dramatically accelerates development timelines—developers can now create or update datasets and benchmarks in hours reasonably than months, supporting faster innovation and more responsive AI solutions.


How you can start:
Step 1: Generate synthetic data with NeMo Data Designer
As an alternative of waiting for real-world notes, we used NeMo Data Designer to create hundreds of synthetic nurse triage notes paired with ground-truth ESI labels. To make sure realism, we defined structured prompts and constraints that mimic authentic clinical language and edge cases. Before moving forward, we validated the generated data for consistency in terminology and plausible vitals to avoid introducing bias.
Key features:
- Structured prompts and domain constraints for realism
- LLM-as-a-judge scoring to filter high-quality examples
- Hugging Face-compatible dataset upload for straightforward integration
First, we initialize the NeMo Data Designer client and model configurations. Here, we establish a connection to the NeMo Data Designer service and define the LLMs that may perform the work. By default, the microservice is configured to make use of construct.nvidia.com because the model provider. This means that you can select from a big selection of NVIDIA Nemotron open models, optimized and packaged as NVIDIA NIM.
from nemo_microservices.data_designer.essentials import *
# 1. Hook up with the Client
data_designer_client = NeMoDataDesignerClient(base_url="http://localhost:8080")
# 2. Define Model Configs
# You will discover available Model IDs at construct.nvidia.com
# Configuration for the 'Author', e.g. nvidia/nvidia-nemotron-nano-9b-v2
generator_config = ModelConfig(
provider="nvidiabuild",
alias="content_generator",
model="",
inference_parameters=InferenceParameters(temperature=0.7, max_tokens=8000)
)
# The "Judge" evaluates quality, e.g. openai/gpt-oss-120b
judge_config = ModelConfig(
provider="nvidiabuild",
alias="judge",
model="",
inference_parameters=InferenceParameters(temperature=0.1, max_tokens=4096)
)
# 3. Initialize the Builder
config_builder = DataDesignerConfigBuilder(model_configs=[generator_config, judge_config])
Next, we define the “seed” data that can be used to generate the synthetic triage notes. We use samplers to create random attributes—akin to the ESI level, specific clinical scenarios, patient details, and writing variety of the note—that may later be injected into the LLM prompt.
config_builder.add_column(
SamplerColumnConfig(
name="record_id",
sampler_type=SamplerType.UUID,
params={"short_form": True, "uppercase": True}
)
)
config_builder.add_column(
SamplerColumnConfig(
name="esi_level_description",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(values=["ESI 1: Resuscitation", "ESI 2: Emergency", ...),
)
)
config_builder.add_column(
SamplerColumnConfig(
name="clinical_scenario",
sampler_type=SamplerType.SUBCATEGORY,
params=SubcategorySamplerParams(
category="esi_level_description",
values={
"ESI 1: Resuscitation": ["Cardiac arrest", "Severe respiratory distress", ...],
"ESI 2: Emergency": ["Chest pain", "Stroke symptoms", ...],
# ... define lists for ESI 3, 4, and 5
},
),
)
)
config_builder.add_column(
SamplerColumnConfig(
name="patient",
sampler_type=SamplerType.PERSON,
params=PersonSamplerParams(age_range=[18, 70]),
)
)
config_builder.add_column(
SamplerColumnConfig(
name="writing_style",
sampler_type=SamplerType.CATEGORY,
params=CategorySamplerParams(values=["Draft", "Adequate", "Polished"]),
)
)
We use an LLMTextColumn to generate the actual triage note. By utilizing a structured prompt with Jinja templating, we not only inject sampled values (like age and scenario) but additionally implement strict formatting constraints (akin to “CC:” and “HPI:”). This ensures the model adopts the telegraphic variety of a busy nurse reasonably than writing a generic description.
# Generate the realistic triage note
config_builder.add_column(
LLMTextColumnConfig(
name="content",
model_alias="content_generator",
prompt=(
"You're an experienced triage nurse. Write a sensible triage note. "
"The note is for a {{ patient.age }} y/o {{ patient.sex }}. "
"Triage classification: '{{ esi_level_description }}'. "
"Reason for visit: '{{ clinical_scenario }}'. "
"Desired writing style: '{{ writing_style }}'. "
"Structure the note with 'CC:' and 'HPI:'. "
"Respond with ONLY the note text."
),
)
)
To make sure quality, we immediately grade the generated data. We add an LLMJudgeColumn that evaluates the note for “clinical coherence” and “complexity.” This enables us to filter out hallucinations or overly easy examples later.
# Define a rubric for Clinical Coherence
clinical_coherence_rubric = Rating(
name="Clinical Coherence",
description="Evaluates if clinical details align with the ESI level.",
options={
"5": "Perfect alignment; clinically plausible.",
"1": "Clinically incoherent.",
# ... intermediate scores
}
)
# Define a rubric for Complexity
esi_level_complexity_rubric = Rating(
name="ESI Level Complexity",
description="Evaluates difficulty to infer the ESI level from the note.",
options={
"Complex": "Note comprises subtle or conflicting information",
"Moderate": "Note requires some clinical inference",
"Easy": "Note uses clear indicators that make the ESI level obvious."
}
)
# Add the Judge Column
config_builder.add_column(
LLMJudgeColumnConfig(
name="triage_note_quality",
model_alias="judge",
prompt="You're an authority ER physician. Evaluate this triage note...",
scores=[clinical_coherence_rubric, esi_level_complexity_rubric],
)
)
Now that our SDG workflow is defined, we run a small preview first to ascertain our synthetically generated data.
# Generate 10 examples to confirm configuration
preview = data_designer_client.preview(config_builder, num_records=10)
preview.display_sample_record()
Once satisfied, we will launch the total generation job (e.g., 100 or 1,000 records).
# Submit batch job
job_results = data_designer_client.create(config_builder, num_records=100)
job_results.wait_until_done()
dataset = job_results.load_dataset()
This approach allows developers to scale from a whole lot to hundreds of labeled examples in minutes—without exposing any real patient data.
Step 2: Evaluate model performance with NVIDIA NeMo Evaluator
Once we’ve got our synthetic dataset, we use NeMo Evaluator to benchmark an LLM’s predictions against the bottom truth. NeMo Evaluator provides a unified API that automates running standardized tests and custom benchmarks for speed and reproducibility. For this workflow, apply a custom accuracy metric implemented as a string check to validate whether the model output comprises the proper label. We integrate this evaluation right into a CI/CD pipeline so every model update triggers automated checks, ensuring continuous validation reasonably than one-off testing.
Before running the evaluation, we want to upload our filtered synthetic dataset to a datastore (like Hugging Face) that the Evaluator service can access. We separate the info by complexity level to see how models perform on harder tasks.
from huggingface_hub import HfApi
# ... filtering logic to separate dataset by complexity (Easy, Moderate, Complex) ...
# Loop through complexity levels and upload to Hugging Face
for level, df in df_complexities.items():
repo_id = f"triage-eval/nurse-triage-notes-{level}"
file_name = f"dataset_{level}.jsonl"
# Save to JSONL and upload
df.to_json(file_name, orient="records", lines=True)
hf_api.upload_file(
path_or_fileobj=file_name,
path_in_repo=file_name,
repo_id=repo_id,
repo_type="dataset",
# ...
)
print(f"Uploaded dataset for complexity: {level}")
We define a configuration object that tells the Evaluator what to do. Here, we specify a custom evaluation type using a completion task. We offer a prompt template that asks the model to act as an authority nurse and output only the ESI level.
Crucially, we define the metric as a string-check. This checks if the model’s output comprises the proper ground-truth label (e.g., “ESI 2: Emergency”).
EVALUATOR_CONFIG = {
"eval_config": {
"type": "custom",
"tasks": {
"triage_classification": {
"type": "completion",
"params": {
"template": {
"messages": [
{"role": "system", "content": "You are an expert ER triage nurse..."},
{"role": "user", "content": "Triage Note: {{item.content}}..."}
],
}
},
# Define success metric: Does the output contain the bottom truth?
"metrics": {
"accuracy": {
"type": "string-check",
"params": {
"check": ["{{sample.output_text}}", "contains", "{{item.esi_level_description}}"]
}
}
},
"dataset": { "files_url": None } # Placeholder, filled dynamically later
}
}
},
# ... target_config for the model endpoint
}
Finally, we iterate through our different models (e.g., Qwen and NVIDIA Nemotron) and our different complexity datasets. We submit a job for each combination to the NeMo Evaluator client and print the accuracy scores.
MODEL_SPECS = [
{"name": "Qwen3-8B", "model_id": "Qwen/Qwen3-8B", ...},
{"name": "Nemotron Nano 9B", "model_id": "nvidia/nvidia-nemotron-nano-9b-v2", ...}
]
# Run evaluation for each model on every complexity level
for complexity in ["simple", "moderate", "complex"]:
for spec in MODEL_SPECS:
# 1. Update config with specific model and dataset URL
config = copy.deepcopy(EVALUATOR_CONFIG)
config['eval_config']['tasks']['triage_classification']['dataset']['files_url'] = files_url_dict[complexity]
config['target_config']['model']['api_endpoint']['url'] = spec['url']
# 2. Submit Job
job = client.evaluation.jobs.create(
goal=config['target_config'],
config=config['eval_config']
)
# 3. Wait for results
results = client.evaluation.jobs.results(job.id)
accuracy = results.tasks['triage_classification'].metrics['accuracy'].value
print(f"Model: {spec['name']} | Complexity: {complexity} | Accuracy: {accuracy:.2%}")
By structuring the evaluation this manner, we move beyond a single, aggregate accuracy rating and gain granular insights into model behavior. We are able to now pinpoint exactly where a model struggles—perhaps it handles “easy” triage notes perfectly but hallucinates details in “complex” scenarios.
This automated loop transforms evaluation from a manual, one-off event right into a continuous validation engine. Whether you might be swapping out model architectures or tweaking prompt templates, this pipeline ensures that each change is rigorously benchmarked against ground-truth data, providing the boldness needed to deploy clinical AI agents into production.
Takeaways
Data scarcity and privacy regulations now not must be a bottleneck for innovation. As we demonstrated, you possibly can now construct robust, domain-specific evaluation benchmarks without ever exposing a single real patient or customer record.
By combining NeMo Data Designer for generation and NVIDIA NeMo Evaluator for validation, you possibly can turn the slow, manual means of model benchmarking right into a rapid, automated workflow. Start with the notebook on GitHub.
Able to dive deeper?
Stay awake-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube. And visit our Nemotron developer page for all of the essentials it’s essential to start with essentially the most open, smartest-per-compute reasoning model.
