Should you are constructing a RAG (Retrieval-Augmented Generation) system, you’ve got likely hit this wall: Every part works… until it doesn’t. General-purpose embedding models are trained to know the web; not your contracts, manufacturing logs, proprietary chemical formulations or internal taxonomy. They capture broad semantic similarity, but they don’t understand the fine-grained distinctions that matter in your domain. Wonderful-tuning an embedding model can improve the performance of your retrieval pipeline when off-the-shelf models fail to effectively capture domain-specific nuances. Despite how critical embeddings are to RAG performance, the method stays surprisingly fragmented, the abilities required are specialized, and the time investment is daunting.
With a single GPU and lower than a day of coaching time, you may transform a general-purpose embedding model into one that really understands your domain, no manual labeling required. To enable you hit the bottom running, we’re also releasing a ready-to-use synthetic training dataset generated from NVIDIA’s public documentation using this exact pipeline. Using this data and the recipe, we saw over 10% improvement in each Recall@10 and NDCG@10. Atlassian applied this recipe to fine-tune on their JIRA dataset, increasing Recall@60 from 0.751 to 0.951, a 26% improvement – on a single GPU.
🔗Quick Links to Dataset and Codes:
🧑💻Open Source Projects Recipe Integrates:
- NeMo Data Designer for synthetic data generation
- NeMo Automodel for embedding model training
- BEIR for Information retrieval evaluation
- NeMo Export-Deploy for ONNX/TensorRT conversion
- NVIDIA NIM for production inference serving
📋Prerequisites:
- A directory of domain documents (text files – .txt, .md, or similar)
- A legitimate NVIDIA API key (free at construct.nvidia.com)
- NVIDIA Ampere GPU or newer with not less than 80GB memory (with Compute Capability >= 8.0)
- This tutorial has been tested on 1xA100 (80GB), and 1xH100 (80GB)
By the top of this post, you’ll know the right way to answer:
📄 Generate training data from domain documents without labeled data
🎯 Use hard negative mining for effective contrastive training
🔗 Improve embedding quality with multi-hop queries
⚙️ Wonderful-tune a bi-encoder embedding model
📊 Evaluate whether fine-tuning improves retrieval
🚀 Deploy the fine-tuned model in your pipeline
⚙️Setup
On this tutorial, we’ll finetune the bottom model Llama-Nemotron-Embed-1B-v2 – a 1-billion-parameter embedding model that balances quality and inference cost. To start, follow this setup guide.
📚 Step 1: Generate Training Data from Documents
Wonderful-tuning an embedding model requires 1000’s of (query, relevant document) pairs. Most use cases don’t have this data available. Creating it manually is dear, slow, and sometimes biased by the annotator’s personal interpretation of what’s “relevant.”
As an alternative of labeling data by hand, you should use an LLM (nvidia/nemotron-3-nano-30b-a3b) to read your documents and robotically generate high-quality synthetic query–answer pairs.
nemotron embed sdg -c default corpus_dir=./data/my_domain_docs
How does it work?
Behind the scenes, this runs a four-stage synthetic data generation (SDG) pipeline powered by NeMo Data Designer:
What does the output appear like?
Source document chunk:
The thermal design power (TDP) of the H100 GPU is 700W in SXM form factor. The cooling solution must maintain junction temperature below 83°C under sustained workloads. Liquid cooling is beneficial for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.
Generated QA pairs:
{
"query": "What cooling approach is beneficial when deploying greater than 4 H100 GPUs per server node?",
"answer": "Liquid cooling is beneficial for dense deployments exceeding 4 GPUs per node, as air cooling cannot dissipate sufficient heat in standard 2U chassis configurations.",
"query_type": "contextual",
"reasoning_type": "factual",
"question_complexity": 3,
"segment_ids": [1],
"quality_score": 8.5
}
{
"query": "How does the 700W TDP of the H100 SXM constrain the alternative between air and liquid cooling in multi-GPU configurations?",
"answer": "The 700W TDP generates substantial heat that should be dissipated to maintain junction temperatures below 83°C. In dense configurations exceeding 4 GPUs per node, air cooling in standard 2U chassis cannot handle this thermal load, making liquid cooling vital.",
"query_type": "multi_hop",
"reasoning_type": "causal",
"question_complexity": 4,
"segment_ids": [1, 2],
"hop_count": 2,
"quality_score": 9.0
}
Notice the difference: the primary query is an easy factual lookup. The second requires multi-hop, causal reasoning. The pipeline generates each types, with configurable complexity levels (2–5) and hop counts (1–3). Each QA pair then undergoes quality evaluation, receiving sub-scores for relevance, accuracy, context support, and clarity, together with an overall rating. Only pairs that meet the edge are included in training.
⛏️ Step 2: Mine Hard Negatives (and Why They Matter)
Should you train an embedding model with only positive pairs (query + correct document), it learns to differentiate obviously different documents but fails on the hard cases — passages that look relevant but are usually not the suitable answer. In an actual retrieval system, these near-misses are precisely the documents that cause bad answers. Hard negative mining finds these confusing passages so the model can learn to inform them apart.
nemotron embed prep -c default
The above command runs three sub-steps robotically:
2a. Train / Validation / Test Split
The generated QA pairs are split into training (80%) and test (20%) sets. The test set is formatted as a BEIR-compatible benchmark for standardized evaluation in Step 5.
2b. Hard Negative Mining
Using the bottom embedding model, the pipeline:
- Embeds every query and each passage within the corpus.
- Computes similarity between each query and all passages.
- Masks out each query’s labeled positive documents.
- Applies a margin filter: any non-positive document scoring above 95% of the minimum positive rating is eliminated. This exclusion zone guards against false negatives — unlabeled passages which might be so near the positive they might actually be relevant.
- From the surviving candidates, selects the top-k highest-scoring documents as hard negatives (5 per query by default).
The result: hard negatives are essentially the most similar non-positive passages that also fall safely below the positive-score ceiling. They’re passages the present model considers highly relevant but that are usually not the labeled answer.
Why this works: Training on easy negatives (completely unrelated passages) teaches the model nothing recent. Training on hard negatives forces it to learn the subtle distinctions that matter in your domain. For instance, in a medical corpus, a matter about “metformin dosage for Type 2 diabetes” may need hard negatives about “metformin uncomfortable side effects” or “insulin dosage for Type 1 diabetes” — close but critically different. The 95% margin ceiling prevents the miner from choosing passages which might be too near the positive, which could actually be correct answers that simply weren’t labeled during SDG.
2c. Multi-Hop Unrolling
Multi-hop questions reference multiple positive documents. For instance, a matter like “How does the thermal management system in Section 3.2 relate to the facility constraints described in Section 5.1?” has two positive passages.
Unrolling creates one training example per (query, positive document) pair, so the contrastive loss sees each positive independently. A matter with 2 positive documents becomes 2 training examples, each with the identical hard negatives but a distinct positive.
The ultimate output is a training-ready JSON file:
{
"question_id": "q42_0",
"query": "How does the 700W TDP of the H100 SXM constrain cooling decisions in multi-GPU nodes?",
"pos_doc": [{"id": "d_a1b2c3"}],
"neg_doc": [{"id": "d_x7y8z9"}, {"id": "d_m4n5o6"}, {"id": "d_p1q2r3"}, {"id": "d_s4t5u6"}, {"id": "d_v7w8x9"}]
}
🔍 Step 3: Understand Multi-Hop Questions and Why They Improve Retrieval
Standard embedding fine-tuning generates one query per passage and trains the model to match them. This works for easy factual lookups, but real users ask complex questions that span multiple documents or sections. If the model has only seen single-hop training data, it’s going to struggle to retrieve all the relevant passages for these complex queries.
The SDG pipeline generates questions at 1 to three hops by default:
- 1-hop: “What’s the TDP of the H100 SXM?” — answered by a single passage.
- 2-hop: “How does the H100’s TDP relate to cooling requirements in dense deployments?” — requires connecting information from two passages.
- 3-hop: “Given the TDP, cooling constraints, and rack density limits, what’s the maximum variety of H100 GPUs deployable in a normal data center row?” — synthesizes three passages.
Each hop is tracked with its own context summary and segment IDs, so the training data preserves the complete reasoning chain. After unrolling (Step 2c), each (query, relevant passage) pair becomes an independent training signal, teaching the model that all of those passages are relevant to the multi-hop query.
The fine-tuned model learns to retrieve contextually related documents, not only lexically similar ones.
🧠 Step 4: Wonderful-Tune the Embedding Model
nemotron embed finetune -c default
How contrastive learning works
The training uses a biencoder architecture with contrastive loss.
The temperature of 0.02 is deliberately aggressive, it produces a really sharp probability distribution. This works well since the hard negatives from Step 2 are high-quality: they’re genuinely confusing passages that the model needs strong gradients to learn to differentiate.
Key hyperparameters
| Parameter | Default | Notes |
|---|---|---|
| Epochs | 3 | For big dataset, you could lower it to 2 or 1 |
| Learning rate | 1e-5 | Tuning: try double and half of the default value |
| Learning rate warmup steps | 5 | Set to 5-10% of total steps of finetune to have higher early training stability |
| Global batch size | 128 | Auto-scaled down for small datasets |
| Passages per query | 5 | 1 positive + 4 hard negatives |
Auto-scaling for small datasets
In case your dataset has fewer than 2,000 training examples, the pipeline robotically:
- Reduces the batch size (to 16–64) so gradients are meaningful.
- Adjusts checkpoint frequency to make sure not less than three checkpoints per run.
- Scales validation frequency proportionally.
This implies you may start with a small corpus (50–100 documents) for a fast proof-of-concept and scale up later.
📈 Step 5: Measure the Improvement
Did fine-tuning actually help? Let’s discover by running a standardized evaluation comparing the bottom model against the fine-tuned checkpoint on the held-out test set:
nemotron embed eval -c default
The evaluation uses the BEIR framework and computes 4 standard information retrieval metrics at k = 1, 5, 10, and 100:
- nDCG@k: Rating quality — are one of the best documents ranked highest?
- Recall@k: Coverage — what fraction of relevant documents appear in the highest k?
- Precision@k: Accuracy — what fraction of the highest k results are literally relevant?
- MAP@k: Average precision across all queries
A successful fine-tune typically leads to a 15% improvement in nDCG@10 and Recall@10 inside <1 day.
Results using Retrieval Synthetic NVDocs:
📊 Comparison (Base -> Wonderful-tuned)
============================================================
NDCG:
NDCG@1: 0.55178 → 0.60796 (+0.05618, +10.2%)
NDCG@5: 0.51894 → 0.57689 (+0.05795, +11.2%)
NDCG@10: 0.55506 → 0.61559 (+0.06053, +10.9%)
NDCG@100: 0.60617 → 0.66567 (+0.05950, +9.8%)
Recall:
Recall@1: 0.28478 → 0.31547 (+0.03069, +10.8%)
Recall@5: 0.54486 → 0.60288 (+0.05802, +10.6%)
Recall@10: 0.62979 → 0.69296 (+0.06317, +10.0%)
Recall@100: 0.81421 → 0.87020 (+0.05599, +6.9%)
What if the numbers don’t improve?
The pipeline makes it easy to iterate:
- Low quality scores in SDG? Check your document quality — clean, well-formatted text produces higher synthetic data. Try a bigger and more powerful LLM.
- Not enough training data? Add more documents to your corpus and re-run Stage 0.
- Overfitting? Reduce epochs or increase the standard threshold to maintain only one of the best training examples.
- Incorrect learning rate? Try 5e-6 for larger datasets or 2e-5 for very small ones.
🏆 Real-World Results: Atlassian
This recipe has been validated on real enterprise data by Atlassian. They applied this pipeline to fine-tune Llama-Nemotron-Embed-1B-v2 on a public Jira dataset using a single NVIDIA A100 80GB GPU, following the identical stages described above
Recall@60 jumped from 0.751 to 0.951 — a 26.7% gain.
The fine-tuned model retrieves the proper document inside the top 60 results for 95.1% of queries, up from 75.1% with the bottom model. For a retrieval system underpinning Jira search, this directly translates into more relevant results for tens of millions of users. Find more details of their blog post Advancing semantic seek for tens of millions of Rovo users.
🚀 Step 6: Export and Deploy
A PyTorch checkpoint is great for evaluation but too slow for production. The ultimate two stages convert the model and serve it behind an API.
Export to ONNX / TensorRT
nemotron embed export -c default
This exports the fine-tuned checkpoint to ONNX (opset 17). Optionally, it compiles a TensorRT engine for max inference throughput, with configurable optimization profiles for batch size (1–64) and sequence length (3–256):
# ONNX only (runs anywhere)
nemotron embed export -c default export_to_trt=false
# FP8 quantization for further speedup
nemotron embed export -c default quant_cfg=fp8
Deploy with NVIDIA NIM
The exported model is deployed inside an NVIDIA NIM container — a production-ready inference microservice exposing an OpenAI-compatible /v1/embeddings endpoint:
nemotron embed deploy -c default
Once running, any client can call it:
curl -X POST http://localhost:8000/v1/embeddings
-H "Content-Type: application/json"
-d '{"input": ["What cooling is needed for 8 H100 GPUs in a 2U chassis?"],
"model": "custom",
"input_type": "query"}'
Because NIM serves an OpenAI-compatible API, you may drop it into any existing RAG pipeline that uses the embeddings API format — no code changes needed.
Confirm deployment accuracy
The pipeline features a NIM accuracy verification step that runs the identical BEIR evaluation against the deployed endpoint:
nemotron embed eval -c default eval_nim=true eval_base=false
This catches any accuracy loss from the ONNX/TensorRT conversion. Metrics that match inside tolerance (0.03 for @1, 0.01 for @5+) are marked with a check; deviations beyond conversion noise are flagged.
Putting It All Together
The complete embedding fine-tuning pipeline will be run in six commands, from raw documents to a deployed model.
# 1. Generate synthetic training data out of your documents
nemotron embed sdg -c default corpus_dir=./data/my_docs
# 2. Prepare the training data (split data, mine hard negatives, unroll)
nemotron embed prep -c default
# 3. Wonderful-tune the embedding model
nemotron embed finetune -c default
# 4. Evaluate the bottom vs. fine-tuned model
nemotron embed eval -c default
# 5. Export the optimized model
nemotron embed export -c default
# 6. Deploy the model
nemotron embed deploy -c default
Expected time and resources
| Stage | GPU Required? | Estimated Time | Notes |
|---|---|---|---|
| SDG | No (uses API) | ~1 hour | Varies by corpus size and API rate limit |
| Data Prep | Yes (40 GB VRAM) | ~5 min | Hard negative mining on GPU |
| Wonderful-Tune | Yes (80 GB VRAM) | ~1 hours | Varies by dataset size and epochs |
| Eval | Yes (40 GB VRAM) | ~5 min | |
| Export | Yes (40 GB VRAM) | ~5 min | TensorRT requires NGC container |
| Deploy | Yes (40 GB VRAM) | ~5 min | NIM container startup |
Total: under a day, with most time being hands-off training. For a small corpus (~500 documents), your entire pipeline completes in about 2–3 hours.
The pipeline can run end-to-end, but each stage may also be executed independently depending in your place to begin. For instance, if you’ve got raw documents, you may begin with synthetic data generation (SDG), while datasets that already include hard negatives can skip earlier steps and go on to fine-tuning. Since every stage uses standard formats comparable to JSON, BEIR, and ONNX, it’s easy to integrate custom components or reuse intermediate outputs in other workflows. The recipe can be flexible in the way it runs, supporting execution on a neighborhood machine, inside Docker containers, or on Slurm-based clusters.
Try It Yourself
If you’ve got domain documents and a few time in your hand, you may generate your first batch of synthetic training data today! The complete pipeline – from documents to a deployed, domain-adapted embedding model – runs in under a day on a single GPU. You possibly can start with our ready-made nvidia/Retrieval-Synthetic-NVDocs-v1 dataset to try the pipeline straight away. Tell us what you construct.
Star the repos for Nemotron, NeMo Data Designer and NeMo Automodel if you happen to find them useful.



