How one can Construct License-Compliant Synthetic Data Pipelines for AI Model Distillation

-


Specialized AI models are built to perform specific tasks or solve particular problems. But in the event you’ve ever tried to fine-tune or distill a domain-specific model, you’ve probably hit just a few blockers, similar to:

  • Not enough high-quality domain data, especially for proprietary or regulated use cases
  • Unclear licensing rules around synthetic data and distillation
  • High compute costs when a big model is excessive for targeted tasks
  • Slow iteration cycles that make it difficult to achieve production-level ROI

These challenges often prevent promising AI projects from progressing beyond the experimental phase.

This post walks you thru the best way to remove all 4 of those blockers using a production-ready, license-safe synthetic data distillation pipeline.

The open source tools utilized in this walkthrough include OpenRouter, which simplifies model access, and distillable endpoints, which remove uncertainty around distillation eligibility. In parallel, NVIDIA NeMo Data Designer lets you define data generation pipelines as code—making datasets reproducible, scalable, inspectable, and simple to evolve as requirements change.

Together, these tools make model specialization accessible to any developer, not only teams with massive datasets or long legal reviews. The result’s production-ready specialized models—without compliance risk or unnecessary cost.

What you’ll construct on this tutorial

This tutorial walks you thru an entire, repeatable workflow for constructing a compliant synthetic data and distillation pipeline, even when real data is scarce or sensitive.

Specifically, you’ll learn the best way to:

  • Generate realistic, domain-specific product data and Q&A pairs using NeMo Data Designer, seeded from a small catalog and structured prompts
  • Control data diversity and structure using schema definitions, samplers, and templated prompts
  • Mechanically rating and filter synthetic data for quality with an LLM-as-a-judge rubric that measures answer completeness and accuracy​
  • Produce a clean, license-safe dataset ready for downstream distillation or fine-tuning workflows through OpenRouter distillable endpoints

While this walkthrough uses a product Q&A example, the identical pattern applies to enterprise search, support bots, internal tools, and other domain workloads.

You’ll generate synthetic data and question-answer pairs from a small seed catalog. The output is a structured dataset containing product names, descriptions, prices, and Q&A pairs. To see the total NeMo Data Designer: Product Information Dataset Generator with Q&A example, visit the NVIDIA/GenerativeAIExamples GitHub repo.

To make sure data quality, you’ll also apply an LLM-as-a-judge approach to robotically rating and filter generated outputs. In production, you may use a separate evaluation, but for simplicity, this walkthrough uses the identical model for each generation and evaluation. 

Flow diagram of a three-stage synthetic data pipeline, from structured input seeds through synthetic product and Q&A generation, followed by LLM-based to accuracy and completeness evaluation and filtering into a final, license-compliant dataset.
Flow diagram of a three-stage synthetic data pipeline, from structured input seeds through synthetic product and Q&A generation, followed by LLM-based to accuracy and completeness evaluation and filtering into a final, license-compliant dataset.
Figure 1. End-to-end synthetic data generation and evaluation workflow

Constructing an artificial product Q&A dataset

This section walks you thru the steps involved in constructing an artificial product Q&A dataset.

Initial setup 

First, install the NVIDIA Data Designer library:

pip install data-designer==0.4.0

Then import the required libraries:

import data_designer.config as dd
from data_designer.interface import DataDesigner

Next, create a model profile and initialize the Data Designer client:

# We set trainable text to true here 
model_provider = dd.ModelProvider(
    name = "deepinfra",
    endpoint = "https://openrouter.ai/api/v1/",
    provider_type = "openai",
    api_key = Open_Router_Api_Key,
    extra_body={
        "provider": {
            "enforce_distillable_text": True,
            # optionally, prefer DeepInfra endpoints
            "only": ["deepinfra"]
        }
    }
)

data_designer_client = DataDesigner(model_providers=[model_provider])

On this step, the NVIDIA Nemotron 3 Nano model is served through OpenRouter and routed to DeepInfra. Distillable enforcement is enabled to make sure all generated data is license-safe for downstream training and distillation.

Next, define generation model configurations and inference parameters:

model_alias="nemotron-3-nano-30b-a3b"

inference_parameters = dd.ChatCompletionInferenceParams(
    temperature=0.5,
    top_p=0.9,
    max_tokens=10000,
    max_parallel_requests=10,  # Variety of concurrent employees
    extra_body={
        "reasoning": {"enabled": False}
    },
)

model_configs = [
    dd.ModelConfig(
        alias=model_alias,
        model="nvidia/nemotron-3-nano-30b-a3b",
        provider="deepinfra",
        inference_parameters=inference_parameters
        )
]

This walkthrough uses Nemotron 3 Nano for synthetic data generation. Nemotron 3 Nano is the newest NVIDIA hybrid Mamba MOE reasoning model, optimized for complex data structures and efficient scaling.

The pipeline builds synthetic Q&A knowledge in three layers: input seeds, generation, and evaluation.

Design the goal dataset schema 

Before writing any pipeline code, it’s vital to define what the ultimate dataset should appear like. This determines which parts require LLM generation, which require sampling, and the way every little thing matches together.

The goal here is to provide a structured, distillation-ready product Q&A dataset with the next characteristics:

  • Each row represents a single product example
  • Fields include each grounded product attributes and generated natural-language content
  • The dataset supports quality filtering before downstream training or distillation

At a high level, each record incorporates:

  • Seed attributes (category, price range, naming constraints)
  • Structured product metadata (name, features, description, price)
  • User-facing language (questions and answers)
  • Quality scores (accuracy and completeness)

This schema-first approach ensures the dataset is reproducible, inspectible, and aligned with downstream training requirements.

Map the dataset schema to generation strategies

With the goal dataset schema defined, the subsequent step is to map each column to an appropriate generation strategy. Some fields require controlled randomness, others require structured LLM outputs, and others exist purely to guage quality. NVIDIA Data Designer provides a declarative approach to express these decisions as code:

config_builder = dd.DataDesignerConfigBuilder(model_configs=model_configs)

Each column within the dataset falls into one among three categories:

  1. Seed and control columns, generated through sampling to make sure diversity
  2. Content columns, generated by LLMs using structures prompts
  3. Evaluation columns, used to attain and filter output quality

Add sampler columns to regulate diversity

These sampled columns define the controllable dimensions of the dataset and ensure coverage across categories, prices, and naming patterns without counting on LLM randomness alone:

import string
from pydantic import BaseModel
from pydantic import Field

# Define product category options
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="category",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(
            values=[
                "Electronics",
                "Clothing",
                "Home Appliances",
                "Groceries",
                "Toiletries",
                "Sports Equipment",
                "Toys",
                "Books",
                "Pet Supplies",
                "Tools & Home Improvement",
                "Beauty",
                "Health & Wellness",
                "Outdoor Gear",
                "Automotive",
                "Jewelry",
                "Watches",
                "Office Supplies",
                "Gifts",
                "Arts & Crafts",
                "Baby & Kids",
                "Music",
                "Video Games",
                "Movies",
                "Software",
                "Tech Devices",
            ]
        ),
    )
)

# Define price range to seed realistic product types
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="price_tens_of_dollars",
        sampler_type=dd.SamplerType.UNIFORM,
        params=dd.UniformSamplerParams(low=1, high=200),
    )
)

config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="product_price",
        expr="{ round(2) }",
        dtype="float",
    )
)

# Generate first letter for product name to make sure diversity
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="first_letter",
        sampler_type=dd.SamplerType.CATEGORY,
        params=dd.CategorySamplerParams(values=list(string.ascii_uppercase)),
    )
)

# Determine if this instance will include hallucination
config_builder.add_column(
    dd.SamplerColumnConfig(
        name="is_hallucination",
        sampler_type=dd.SamplerType.BERNOULLI,
        params=dd.BernoulliSamplerParams(p=0.5),
    )
)          

Add LLM-generated columns

For columns that require natural language or structural semantic content, use LLM-backed generation with explicit output schema. This ensures consistency across records and makes the dataset suitable for downstream training and evaluation.

When constructing the dataset, it’s vital to acknowledge that LLM-generated columns don’t exist in isolation—they’re intentionally conditioned on earlier sampler and seed columns, which inject controlled diversity into the generation process.

When prompting the LLM, Jinja templating is used to reference values from other columns within the dataset, similar to sampled categories, prices, or naming constraints. These inputs directly shape the LLM’s outputs, allowing diversity to be introduced systematically quite than counting on prompt randomness alone. Nested JSON fields can be accessed using dot notation, enabling structured outputs to flow naturally through the pipeline. 

For instance, the structured ProductInfo output is conditioned on sampled values like product category,  product_price, and name constraints. This ensures that diversity introduced upstream propagates consistently through all LLM-generated fields.

# Define product information structure
class ProductInfo(BaseModel):
    product_name: str = Field(
        ..., description="A practical product name for the market."
    )
    key_features: list[str] = Field(
        ..., min_length=1, max_length=3, description="Key product features."
    )
    description: str = Field(
        ...,
        description="A brief, engaging description of what the product does, highlighting a novel but believable feature.",
    )
    price_usd: float = Field(..., description="The stated price in USD.")


# Generate product information
config_builder.add_column(
    dd.LLMStructuredColumnConfig(
        name="product_info",
        model_alias=model_alias,
        prompt=(
            "Generate a practical product description for a product within the {{ category }} "
            "category that costs {{ product_price }}.n"
            "The name of the product MUST start with the letter {{ first_letter }}.n"
        ),
        output_format=ProductInfo,
    )
)

# Generate user questions on the product
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="query",
        model_alias=model_alias,
        prompt=("Ask an issue in regards to the following product:nn {{ product_info }}"),
    )
)


# Generate answers to the questions
config_builder.add_column(
    dd.LLMTextColumnConfig(
        name="answer",
        model_alias=model_alias,
        prompt=(
            "{%- if is_hallucination == 0 -%}n"
            "n"
            "{{ product_info }}n"
            "n"
            "{%- endif -%}n"
            "User Query: {{ query }}n"
            "Directly and succinctly answer the user's query.n"
            "{%- if is_hallucination == 1 -%}n"
            "Make up whatever information it's worthwhile to with a purpose to answer the user's request.n"
            "{%- endif -%}"
        ),
    )
)

Quality assessment with LLM-as-a-judge

LLM-as-a-judge is used to make sure data quality. Clear evaluation rubrics allow generated answers to be scored for completeness and accuracy before downstream use.

# Define evaluation rubrics for answer quality
CompletenessRubric = dd.Rating(
    name="Completeness",
    description="Evaluation of AI assistant's thoroughness in addressing all facets of the user's query.",
    options={
        "Complete": "The response thoroughly covers all key points requested within the query, providing sufficient detail to satisfy the user's information needs.",
        "PartiallyComplete": "The response addresses the core query but omits certain vital details or fails to elaborate on relevant facets that were requested.",
        "Incomplete": "The response significantly lacks mandatory information, missing major components of what was asked and leaving the query largely unanswered.",
    },
)

AccuracyRubric = dd.Rating(
    name="Accuracy",
    description="Evaluation of how factually correct the AI assistant's response is relative to the product information.",
    options={
        "Accurate": "The data provided aligns perfectly with the product specifications without introducing any misleading or incorrect details.",
        "PartiallyAccurate": "While some information is appropriately stated, the response incorporates minor factual errors or potentially misleading statements in regards to the product.",
        "Inaccurate": "The response presents significantly incorrect information in regards to the product, with claims that contradict the actual product details.",
    },
)


# Evaluate answer quality
config_builder.add_column(
    dd.LLMJudgeColumnConfig(
        name="llm_answer_metrics",
        model_alias=model_alias,
        prompt=(
            "n"
            "{{ product_info }}n"
            "n"
            "User Query: {{query }}n"
            "AI Assistant Answer: {{ answer }}n"
            "Judge the AI assistant's response to the user's query in regards to the product described in ."
        ),
        scores=[CompletenessRubric, AccuracyRubric],
    )
)


# Extract metric scores for easier evaluation
config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="completeness_result",
        expr="{{ llm_answer_metrics.Completeness.rating }}",
    )
)

config_builder.add_column(
    dd.ExpressionColumnConfig(
        name="accuracy_result",
        expr="{{ llm_answer_metrics.Accuracy.rating }}",
    )
)

Preview the dataset

To examine the dataset before scaling, generate a small preview and cargo the outcomes right into a pandas DataFrame:

preview = data_designer_client.preview(config_builder)

# Display one record
preview.display_sample_record()

Table 1 lists example synthetic product Q&A records showing input seed attributes (category, price, hallucination flag), LLM-generated details and Q&A, and LLM-as-a-judge quality scores for accuracy and completeness.

Field Name Value / Generated content
Category (seed) Clothing
Start letter (seed) D
Hallucination flag 1 (Creative mode enabled)
Product name Driftwood Luxe Cashmere Mix Sweater
Product price $545.57
User query What makes the Driftwood Luxe Cashmere Mix Sweater uniquely fitted to each urban sophistication and outdoor adventures…?
AI answer The sweater combines ethically sourced cashmere with merino wool and recycled nylon… its water‑repellent finish and articulated seam construction give it the performance needed for mountain climbing and skiing…
Accuracy rating ⚠️ Partially Accurate
Accuracy reasoning The reply appropriately describes the sweater’s luxury ethos but fabricates material components (merino wool, recycled nylon) and overstates performance claims (mountain climbing, skiing) not present within the provided product info.
Completeness rating ⚠️ Partially Complete
Completeness reasoning The response addresses urban sophistication and ethical sourcing but introduces unmentioned materials and omits the precise “hidden interior pockets” mentioned within the product source.
Table 1. Example synthetic product Q&A records

Scale up data generation

Once the schema and quality checks look good, generate a bigger dataset by increasing the variety of records:

job_results = data_designer_client.create(config_builder, num_records=100)
dataset = job_results.load_dataset()

Save the outcomes

Finally, save the generated dataset as a pandas DataFrame for downstream training, evaluation, or distillation workflows:

from pathlib import Path

Folder_Name = "data-designer-tutorial-output"
File_Name = "dataset_OR.csv"

TUTORIAL_OUTPUT_PATH = Path(Folder_Name)
TUTORIAL_OUTPUT_PATH.mkdir(parents=True, exist_ok=True)

dataset.to_csv(TUTORIAL_OUTPUT_PATH / File_Name, index=False)

Workflow advantages

By combining OpenRouter with NVIDIA open source tooling, developers unlock a faster, safer path to model specialization:

  • Built-in compliance: License-safe synthetic data generation using distillable endpoints
  • High-quality domain data, fast for task-specific models: Rapid creation of structured, domain-specific datasets with NeMo Data Designer for shorter customization cycles for enterprise-ready, task-specific models

This workflow lets you bypass generic LLMs and construct specialized models that understand domain rules, interpret high-level goals, and support complex workflows.

Start with distillation-ready synthetic datasets 

This tutorial focused on the best way to design and generate a distillation-ready synthetic dataset. To start—and take resulting data into the subsequent stages of model training, distillation and deployment—take a look at the next resources:

Not sleep-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn, X, Discord, and YouTube. Visit the Nemotron developer page for every little thing it’s worthwhile to start with probably the most open, smartest-per-compute reasoning models available.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x