Large language models (LLMs) in quantitative finance are increasingly getting used for alpha generation, automated report evaluation, and risk prediction. Yet adoption is constrained by cost, latency, and integration complexity. In financial markets, where alpha signals emerge from rapidly evolving data, the power to constantly fine-tune, distill, and deploy models from proprietary and real-world sources is crucial.
This instance shows how NVIDIA technology enables continuous model fine-tuning and distillation, enabling integration into financial workflows. Researchers can systematically optimize, compress, and deploy high-performing models with direct connectivity to backtesting and strategy evaluation processes.
The AI Model Distillation for Financial Data developer example is meant for quantitative researchers, AI developers, and enterprise data scientists. It shows how NVIDIA technology enables continuous model fine-tuning and distillation, enabling integration into financial workflows. Through the flywheel, we operate over a financial newsfeed dataset to generate features from unstructured data that could be used for alpha research and risk prediction. The result’s a set of smaller, domain-specific, and task-optimized models that maintain high accuracy while reducing computational overhead and deployment costs.
What’s AI Model Distillation for Financial Data?
Model distillation is the strategy of transferring knowledge from a big, high-performing teacher model to a smaller, efficient student model. This permits faster inference, lower resource consumption, and deployment in edge or hybrid environments, while maintaining accuracy on domain-specific tasks.
What’s a developer example?
A developer example is a tested, reproducible reference architecture that mixes best practices, software tools, and modular deployment patterns to speed up enterprise AI adoption. These end-to-end customizable examples show how complex workflows corresponding to domain adaptation, model compression, or agent orchestration could be developed and scaled using the NVIDIA AI Enterprise software stack. They bridge the gap between concept and production, pairing reference code with tested architectural guidance.
This developer example provides a practical framework for continuous domain adaptation and model distillation, creating smaller, high-performance models tailored to enterprise financial data. By combining NVIDIA NeMo, NVIDIA Nemotron, NVIDIA NIM, and Dockerized components, you may develop a knowledge flywheel for feature engineering, signal evaluation, and retraining. The architecture supports each on-premises and hybrid cloud deployment, ensuring flexibility and compliance with financial data governance standards.


This instance distills the capabilities of a 49B or 70B parameter teacher right into a smaller customized student (1B, 3B, or 8B in this instance). We show this through a multi-class classification problem where we use the teacher to generate labels for our dataset after which the labeled dataset to customize our student models.
The developer example enables teams to:
- Distill large LLMs into efficient domain-specific versions fitted to financial text, thus reducing latency and inference costs while maintaining accuracy targets.
- Speed up backtesting and strategy evaluation by enabling rapid iteration and evaluation of trading signals, while maintaining model accuracy as market conditions and data sources evolve.
- Ensure scalability and observability by facilitating model evaluation with built-in experiment tracking.
- Deploy distilled models alongside existing NIM into financial AI workflows across on-prem, hybrid cloud, and edge environments.
These capabilities enable the deployment of lightweight, specialized models directly into research pipelines, trading systems, or edge inference environments.
How does it work?
We offer a reusable recipe to experiment and train these distilled models using the NVIDIA Data Flywheel Blueprint. At the guts of the blueprint is the flywheel orchestrator, a unified control plane that abstracts the complexity of interacting directly with NVIDIA NeMo microservices. Acting because the brain of the flywheel system, the orchestrator API coordinates the information flywheel job by leveraging a set of modular NeMo microservices:
- NVIDIA NeMo Customizer to handle lightweight LoRA-based fine-tuning
- NVIDIA NeMo Evaluator to automate evaluations across runs
- Datastore inside NeMo to administer structured datasets and artifacts
- Deployment manager inside NeMo to spin up and serve candidate distilled models dynamically for inference
Each microservice is packaged as a Docker container for consistent deployment across different environments. This workflow is orchestrated through Kubernetes integration. It ensures dynamic orchestration of NIM microservices for experimentation and production workloads.


Prerequisites for NeMo Microservices
To get the developer example up and running, you’ll first need to establish your environment and deploy the required services. Detailed instructions could be found on GitHub.
Once the environment is prepared, you’ll configure your models and workflows using a config.yaml file.
Note: This file loads when the flywheel server starts. The settings remain static during a flywheel run. To update anything, it’s essential to stop the services, modify the YAML, and redeploy.
Unpacking the workflow
Next, we take a have a look at the developer example in motion, showing each stage of the workflow with chosen code snippets and experiment outputs. We show how different model configurations and dataset size influence performance, efficiency, and accuracy. By showcasing multiple experiment runs and distilled model comparisons, the walkthrough highlights how the developer example enables teams to iteratively refine models and achieve optimal trade-offs between cost, size, and precision.
Step 1: Dataset labeling
We use a sample dataset consisting of stories headlines to show this workflow. Using the teacher model and prompt with few-shot examples (supplied with our code), we generate labels for every headline in our dataset. The teacher is tasked to categorise the headlines into one in every of the thirteen described classes. For sanity checking and evaluating baseline performance of the LLM, we include its performance against a subset of human-labeled samples from the dataset (~1k examples).
The next are three examples of economic news headlines, with their respective labels assigned by the teacher model:
[
{
"Headline": "Ultratech Achieves ISO 9001 and 14001 Certification for Singapore Operations and Recertification for U.S. Facility",
"Classified Category": "Regulatory"
},
{
"Headline": "Mid-Afternoon Market Update: Dow Up Over 200 Points; Lakeland Industries Shares Spike Higher",
"Classified Category": "Stock price movement"
},
{
"Headline": "Analyst: Chipotle Is Successful Because It Sticks To What Works (Giant, Tasty Burritos)",
"Classified Category": "Analyst Rating"
}
]
We run the next steps using the Data Flywheel Blueprint.
Step 2: Dataset ingestion to flywheel server
Next, we ingest the dataset into an Elasticsearch index. The prompt and teacher model responses follow the OpenAI-compliant format, which the information flywheel server uses to run experiments.
"request": {
"model": "meta/llama-3.3-70b-instruct",
"messages": [
{
"role": "system",
"content": "You are a financial news classifier."
},
{
"role": "user",
"content": "USER PROMPT"
}
]
},
"response": {
"selections": [
{
"message": {
"role": "assistant",
"content": "[[[analyst rating]]]"
}
}
]
},
"workload_id": "news_classifier",
"client_id": "", # dataset identifier within the flywheel server
"timestamp": 1760845128 #timestamp when dataset was last updated
}
Moreover, in this instance, we show that the scholar model could be customized to match the teacher’s performance without requiring the complete dataset. We split our dataset into smaller stratified subsets of the unique dataset (5k, 10k, and 25k examples). The split sizes and ratios for sampling from the multiple label classes, a few of which occur less often than others, could be laid out in the config.yaml file, as shown in our default example:
# Data split config:
# train, val, eval split sizes and ratios
data_split_config:
eval_size: 100
val_ratio: 0.1
min_total_records: 50
random_seed: 42
limit: null # null = use all available records (ingress limit increased to 1GB)
parse_function_arguments: true # parse function arguments to JSON objects for tool calling records
stratify_enabled: true # Enable stratified splitting to keep up class balance
min_samples_per_class: 2 # Minimum samples required per class for stratification
rare_class_threshold: 1 # Group classes with <= this many samples as 'others'
Next, using the flywheel server, we repeat the next steps to customize and evaluate the models for various dataset sizes.
Step 3: Wonderful-tuning jobs
Using NeMo Customizer, supervised fine-tuning jobs are launched with LoRA adapters. Each job distills the knowledge from the dataset into the adapter to create smaller task-specific candidates. The scholar models for the distillation needs to be laid out in the config.yaml file.
For instance, to incorporate the llama-3.2-1b-instruct model as one in every of the candidate students, we specify its model name and details following the naming conventions and details within the NeMo Microservices Model Catalog.
nim:
- model_name: "meta/llama-3.2-1b-instruct"
model_type: "llm"
context_length: 8192
gpus: 1
pvc_size: 25Gi
tag: "1.8.3"
customization_enabled: true
customizer_configs:
goal: "meta/llama-3.2-1b-instruct@2.0"
gpus: 1
max_seq_length: 8192
Step 4: Evaluate runs
We then compare the performance of student models with and without customization. This is completed by comparing the F1-score for every candidate student, known as:
- base-eval: Zero-shot F1-score baseline of student model before customization
- customized-eval: F1-score evaluation of customized model
Step 5: Scoring and aggregation
Model outputs are scored using NeMo Evaluator, and results are reported back through the Orchestrator API. We aggregate these results over different students and corresponding dataset sizes.
Step 6: Review and promotion
Developers can programmatically access metrics, download artifacts, launch follow-up experiments, or promote top-performing candidates to production to interchange the teacher NIM.
This loop could be scheduled or triggered on demand, creating an automatic, scalable system that constantly and progressively surfaces smaller, faster, and more cost-efficient models, while preserving the accuracy of the larger baseline model.
Results
The reported F1-scores in Table 1 and Figure 3 are evaluated on a held-out test set and are given relative to the F1-score of the big teacher model. On this setup, the teacher model is taken into account to have an ideal F1-score, against which each distilled student model is compared.
The table clearly shows that larger student models have a greater capability to learn from the teacher’s supervision and achieve higher scores even with a small variety of examples. Because the number of coaching examples increases, the standard of the distilled model improves for every student model size. With enough examples, they converge to similar F1-scores.
These results show the trade-offs and possible gains of using larger student models and more training data during distillation. Practical aspects corresponding to data availability, hardware constraints, latency, and throughput at inference time influence the optimal selections for every application throughout the AI Model Distillation for Financial Data developer example.


| Training Data | Model Name | F1-Rating |
| 5000 | meta/llama-3.2-1b-instruct | 0.29 |
| 10000 | meta/llama-3.2-1b-instruct | 0.78 |
| 25000 | meta/llama-3.2-1b-instruct | 0.9 |
| 5000 | meta/llama-3.2-3b-instruct | 0.584 |
| 10000 | meta/llama-3.2-3b-instruct | 0.89 |
| 25000 | meta/llama-3.2-3b-instruct | 0.95 |
| 5000 | meta/llama-3.1-8b-instruct | 0.8 |
| 10000 | meta/llama-3.1-8b-instruct | 0.94 |
| 25000 | meta/llama-3.1-8b-instruct | 0.95 |
Model distillation in finance enables smaller, faster models to match the performance of complex ones, improving efficiency and explainability without sacrificing accuracy. By transferring knowledge from large teacher models to lightweight students, the AI Model Distillation for Financial Data developer example enables faster decision-making for feature engineering and signal generation, risk management, and surveillance.
Learn more
Model compression continues to advance rapidly, driving recent possibilities for deploying LLMs efficiently across industries, learn more with the next resources:
Start
Visit construct.nvidia.com to deploy the notebook in a GPU-accelerated environment using NVIDIA Brev or your individual cloud infrastructure using a standard GitHub repository.
