In the guts of each modern electronic device lies a silicon chip, built through a producing process so precise that even a microscopic defect can determine success or failure. As semiconductor devices grow more complex, reliably detecting and classifying defects has develop into a critical bottleneck.
Historically, chipmakers have relied on convolutional neural networks (CNNs) to automate defect classification (ADC). But as manufacturing scales and diversifies, CNN-based approaches are hitting their limits, requiring large labeled datasets, frequent retraining, and still struggling to generalize across recent defect types.
On this post, we show how generative AI-powered ADC can overcome these challenges.
The workflows below leverage NVIDIA Metropolis vision language models (VLMs), vision foundation models (VFMs), and the NVIDIA TAO fine-tuning toolkit to modernize defect classification. We outline the constraints of traditional CNN-based systems, detail how VLMs and VFMs address them, and highlight specific approaches and manufacturing challenges they assist solve.
The bounds of CNNs in semiconductor defect classification
CNNs have long been the backbone of defect detection in semiconductor fabs, supporting optical and e-beam inspection, lithographic evaluation, and more. They excel at extracting visual features from large datasets, but manufacturers face persistent challenges related to data requirements, semantic understanding, and retraining.
High data requirements
Achieving high accuracy often requires hundreds of labeled images per defect class. Rare or emerging defects continuously lack sufficient examples for effective training.
Limited semantic understanding
While CNNs capture visual features, they can not interpret context, perform root-cause evaluation, or integrate multimodal data. Additionally they struggle to distinguish visually similar yet operationally distinct defect patterns, similar to center vs. local defects.
Frequent retraining
Real-world manufacturing is dynamic. Process variations, recent tools, and evolving product lines require models to be retrained continuously to acknowledge recent defect types and imaging conditions.
These limitations force fabs to depend on manual inspection, which is expensive, inconsistent, and unable to scale with today’s manufacturing throughput.
Modernizing ADC with VLMs and VFMs
To deal with these challenges, NVIDIA applies VLMs, VFMs, and self-supervised learning across multiple stages of semiconductor manufacturing. Figure 1 illustrates how these models are deployed across front-end-of-line (FEOL) and back-end packaging processes.
On this post, we display how VLMs classify wafer map images and the way VFMs classify die-level images, including optical, e-beam, and back-end optical microscopy (OM) inspection data. With further training, VLMs also show strong potential for die-level inspection.


Wafer-level intelligence with VLMs
Wafer maps provide a spatial view of defect distributions across a whole wafer. VLMs mix advanced image understanding with natural language reasoning. After fine-tuning, NVIDIA reasoning VLMs, similar to Cosmos Reason, can interpret wafer map images to discover macro defects, generate natural language explanations, perform interactive Q&A, and compare test images against “golden” references for preliminary root-cause evaluation.


Using this approach offers several benefits:
- Few-shot learning: VLMs will be fine-tuned with only a small variety of labeled examples, enabling rapid adaptation to recent defect patterns, process changes, or product variations.
- Explainability: As shown in Figure 2, Cosmos Reason produces interpretable results that engineers can interact with using natural language. For instance, asking “What’s the primary defect pattern on this wafer map?” might return “Center ring defect detected, likely because of chemical contamination.” This semantic reasoning ability goes beyond CNNs, helping engineers quickly discover potential root causes, speed up corrective actions, and reduce the quantity of manual reviews.
- Automated data labeling: VLMs can generate high-quality labels for downstream ADC tasks, reducing the time and value of model development. In practice, this approach can cut model construct times by as much as 2x in comparison with manual labeling workflows.
- Time series and lot level evaluation: VLMs have the power to process each still images and video sequences, enabling them to proactively monitor process anomalies over time and mitigate errors before they result in critical failures. In a single study, VLMs achieved high accuracy across each OK and NG cases, outperforming traditional CNN-based methods.


Getting began with Cosmos Reason
Here’s a sample workflow to fine-tune Cosmos Reason 1—from data preparation to supervised fine-tuning and evaluation on a prepared dataset of wafer map defects.
- Go to the Cosmos Cookbook Wafer Map Anomaly Classification
- Create a sample training dataset: Download the open WM-811k Wafermap dataset produced by Mir Lab which is accessible for public use. Generate a sample dataset and respective annotations with the provided scripts within the cookbook.
- Post-train with supervised fine-tuning (SFT): Follow the installation instructions provided within the cosmos-reason1 GitHub repository and install the cosmos-rl package to enable fine-tuning with the curated training data set.
- Deploy
Result: High quality-tuning Cosmos Reason on wafer map defect classification data boosts accuracy from zero-shot levels to over 96% on defect classification tasks.
Die-level precision with VFMs and self-supervised learning
The semiconductor industry continues to push the boundaries of physics as device features shrink to microscopic scales. At this level, manufacturing complexity rises dramatically. Even the slightest anomaly—a stray particle, pattern deviation, or material defect—can render a chip unusable, directly affecting yield and profitability. On this high-stakes environment, the most important bottleneck is the power to rapidly and accurately detect and classify defects. CNNs have supported this workflow for years, but they struggle to maintain pace with the growing complexity and data demands of contemporary fabs.
A core challenge in training AI models for manufacturing is the dependence on large, meticulously labeled datasets. Dynamic processes, evolving product lines, and the continual emergence of recent defect types make it impractical to take care of a wonderfully labeled dataset. Compounding the difficulty, datasets are sometimes highly imbalanced—normal samples vastly outnumber defective ones.
Using a number one VFM similar to NV-DINOv2 provides benefits, including:
- Self-supervised learning (SSL): NV-DINOv2 is trained on hundreds of thousands of unlabeled images, enabling it to generalize recent defect types and process conditions with minimal retraining when labeled data is scarce.
- Robust feature extraction: The model captures each fine-grained visual details and high-level semantic information, improving classification accuracy across diverse manufacturing scenarios.
- Operational efficiency: By reducing dependence on labeling and frequent retraining, NV-DINOv2 streamlines the deployment and maintenance of defect-inspection systems in fast-moving fab environments.
Nonetheless, general foundation models like NV-DINOv2 lack the particular details required for industrial tasks similar to e-beam and optical microscopy images. To attain maximum accuracy, the model should be specialized through domain adaptation.
It is a multi-stage workflow:
- General VFM: Begin with the powerful, pre-trained NV-DINOv2 model that has broad visual understanding learned from large, diverse datasets.
- Domain adaptation: High quality-tune the model using a big, unlabeled, domain-specific dataset, similar to hundreds of thousands of images from semiconductor fabs, to align it with industrial imaging characteristics.
- Downstream task fine-tuning: Apply a small set of labeled images to fine-tune the model for a particular classification task, a step often called linear probing.


The effectiveness of this process depends heavily on the scale and quality of the unlabeled domain dataset. These datasets can range from lower than 1,000,000 images to a whole lot of hundreds of thousands, but quantity alone shouldn’t be enough. A meticulous data-cleaning pipeline is important to remove redundant, blurry, or irrelevant images before training begins.
This domain-adaptation approach delivers significant performance gains. In a single study by a number one semiconductor manufacturer, the NVIDIA TAO Toolkit was used to use self-supervised learning (SSL) to NV-DINOv2 using unlabeled images collected across multiple layers of the chip-production process. Incorporating SSL consistently improved performance, boosting accuracy by as much as 8.9% in comparison with a model trained without SSL which led to productivity gains of as much as 9.9%.
Getting began with NV-DINOv2 and SSL
The next is an end-to-end workflow to fine-tune NV-DINOv2 using SSL, from data preparation and domain adaptation to downstream task fine-tuning and deployment. In this instance, we use the NVIDIA TAO Toolkit to perform SSL on unlabeled PCB images for defect classification.
The NV-DINOv2 workflow follows a progressive, three-phase approach that maximizes the worth of huge unlabeled datasets while reducing the necessity for manual annotation to only a number of hundred labeled samples.
1. Arrange your environment: Download the NVIDIA TAO Toolkit 6.0 container from NVIDIA NGC which has all dependencies pre-installed:
# Pull the TAO Toolkit 6.0 container from NGC
docker pull nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt
# Run the container with GPU support
docker run --gpus all -it -v /path/to/data:/data
nvcr.io/nvidia/tao/tao-toolkit:6.0.0-pyt /bin/bash
2. Prepare your dataset: NV-DINOv2 accepts RGB images in standard formats (JPG, PNG, BMP, TIFF, WebP) stored in a single directory. For SSL domain adaptation, you simply need unlabeled images; no annotations are required.
In our PCB inspection example, we used:
- ~400 labeled test samples for evaluation
- ~A million unlabeled PCB images for domain adaptation
- ~600 labeled training samples for downstream fine-tuning
Organize your data as followed:
/data/
├── unlabeled_images/ # For SSL domain adaptation
├── train_images/ # For downstream fine-tuning
│ ├── OK/
│ ├── missing/
│ ├── shift/
│ ├── upside_down/
│ ├── poor_soldering/
│ └── foreign_object/
└── test_images/ # For evaluation
Data cleansing best practice: Before training, perform a meticulous data cleansing process to remove redundant, blurry, or irrelevant images. The effectiveness of domain adaptation depends heavily on the standard of your unlabeled dataset.
3. Configure the training specification: Create a YAML specification file that defines your model architecture, dataset paths, and training parameters:
model:
backbone:
teacher_type: "vit_l"
student_type: "vit_l"
patch_size: 14
img_size: 518
drop_path_rate: 0.4
head:
num_layers: 3
hidden_dim: 2048
bottleneck_dim: 384
dataset:
train_dataset:
images_dir: /data/unlabeled_images
test_dataset:
images_dir: /data/test_images
batch_size: 16
employees: 10
transform:
n_global_crops: 2
global_crops_scale: [0.32, 1.0]
global_crops_size: 224
n_local_crops: 8
local_crops_scale: [0.05, 0.32]
local_crops_size: 98
train:
num_gpus: 8
num_epochs: 100
checkpoint_interval: 10
precision: "16-mixed"
optim:
optim: "adamw"
clip_grad_norm: 3.0
4. Run SSL training for domain adaptation: Execute the training using TAO Launcher to adapt the overall NV-DINOv2 model to your domain-specific images:
tao model nvdinov2 train
-e /path/to/experiment_spec.yaml
results_dir=/output/ssl_training
train.num_gpus=8
train.num_epochs=100
5. Perform downstream task fine-tuning: After SSL domain adaptation, fine-tune the model in your specific classification task using a small labeled dataset. This step, often called linear probing, requires only a number of hundred labeled samples:
tao model nvdinov2 train
-e /path/to/finetune_spec.yaml
train.pretrained_model_path=/output/ssl_training/model.pth
dataset.train_dataset.images_dir=/data/train_images
train.num_epochs=50
6. Run inference: Evaluate your domain-adapted model on test images:
tao model nvdinov2 inference
-e /path/to/experiment_spec.yaml
inference.checkpoint=/output/ssl_training/model.pth
inference.gpu_ids=[0]
inference.batch_size=32
7. Export to ONNX for deployment: Export your trained model to ONNX format for production deployment:
tao model nvdinov2 export
-e /path/to/experiment_spec.yaml
export.checkpoint=/output/ssl_training/model.pth
export.onnx_file=/output/nvdinov2_domain_adapted.onnx
export.opset_version=12
export.batch_size=-1
The exported ONNX model will be deployed using NVIDIA TensorRT for optimized inference or integrated into an NVIDIA DeepStream pipeline for real-time visual inspection.
Results: Using NVIDIA TAO to fine-tune NV-DINOV2 with SSL can be used for inspecting PCBs. Through the use of a dataset of roughly a million unlabeled images with SSL for industrial domain adaption and 600 training and 400 testing samples for downstream task fine-tuning, accuracy for defect detection jumped from 93.84% with the overall model to 98.51%. By eliminating the necessity for labeling and frequent retraining, NV-DINOv2 streamlines the deployment of defect inspection solutions in fast-moving fab environments.
Paving the technique to a wise fab
These applications of vision models deliver immediate accuracy gains and lay the inspiration for agentic AI systems inside the fab. By combining accelerated computing with generative AI, NVIDIA and leading foundries are introducing recent ADC workflows which have the potential to redefine yield improvement and process control in advanced manufacturing.
By streamlining defect evaluation across the semiconductor production flow, generative AI significantly reduces model deployment time. Its few-shot learning capabilities simplify ongoing model maintenance, improve robustness, and make it easy to fine-tune models for various fab environments.
With fabs generating hundreds of thousands of high-resolution images each day from a big selection of inspection tools, automated ADC systems are expected to further improve classification accuracy, reduce human workload, and elevate overall productivity.
Beyond defect inspection, semiconductor manufacturers are starting to adopt video analytics AI agents built using the NVIDIA Blueprint for Video Search and Summarization (VSS). These agents help monitor plant operations, enhance employee safety, and improve compliance with PPE and safety protocols across manufacturing sites.
Next steps
To learn more, try NV-DINOv2 and state-of-the-art NVIDIA VLMs like Cosmos Reason. For technical questions, please visit the forum.
Watch the SEMICON West keynote from Tim Costa, the General Manager of Industrial and Computational Engineering at NVIDIA, and attend sessions on the show, which runs through December 19
Stay awake to this point by subscribing to our newsletter and following NVIDIA AI on LinkedIn, Instagram, X and Facebook. Explore YouTube channel, and join the NVIDIA Developer vision AI forum.
