The only repository to coach your VLM in pure PyTorch

nanoVLM is the simplest approach to start with
training your very own Vision Language Model (VLM) using pure PyTorch. It’s lightweight toolkit
which lets you launch a VLM training on a free tier colab notebook.

We were inspired by Andrej Karpathy’s nanoGPT, and supply an identical project for the vision domain.

At its heart, nanoVLM is a toolkit that helps you construct and train a model that may understand each
images and text, after which generate text based on that. The great thing about nanoVLM lies in its simplicity.
Your entire codebase is intentionally kept minimal and readable, making it perfect for beginners or
anyone who desires to peek under the hood of VLMs without getting overwhelmed.

On this blog post, we cover the core ideas behind the project and supply an easy approach to interact
with the repository. We not only go into the main points of the project but additionally encapsulate all of it
so you can quickly start.

TL;DR

You’ll be able to start training a Vision Language Model using our nanoVLM toolkit by following these steps:


git clone https://github.com/huggingface/nanoVLM.git


python train.py

Here’s a Colab notebook
that can enable you launch a training run with no local setup required!

What’s a Vision Language Model?

Because the name suggests, a Vision Language Model (VLM) is a multi-modal model that processes two
modalities: vision and text. These models typically take images and/or text as input and generate text as output.

Generating text (output) conditioned on the understanding of images and texts (inputs) is a strong paradigm.
It enables a wide selection of applications, from image captioning and object detection to answering
questions on visual content (as shown within the table below). One thing to notice is that nanoVLM
focuses only on Visual Query Answering because the training objective.

Caption the image	Two cats lying down on a bed with remotes near them	Captioning
Detect the objects within the image		Object Detection
Segment the objects within the image		Semantic Segmentation
What number of cats are within the image?	2	Visual Query Answering

For those who are focused on learning more about VLMs, we strongly recommend reading our latest blog on the subject: Vision Language Models (Higher, Faster, Stronger)

Working with the repository

“Talk is affordable, show me the code” – Linus Torvalds

On this section, we’ll guide you thru the codebase. It’s helpful to maintain a
tab open for reference as you follow along.

Below is the folder structure of our repository. Now we have removed helper files for brevity.

.
├── data
│   ├── collators.py
│   ├── datasets.py
│   └── processors.py
├── generate.py
├── models
│   ├── config.py
│   ├── language_model.py
│   ├── modality_projector.py
│   ├── utils.py
│   ├── vision_language_model.py
│   └── vision_transformer.py
└── train.py

Architecture

.
├── data
│   └── ...
├── models      
│   └── ...
└── train.py

We model nanoVLM after two well-known and widely used architectures. Our vision backbone
(models/vision_transformer.py) is the usual vision transformer, more specifically Google’s
SigLIP vision encoder. Our language
backbone follows the Llama 3 architecture.

The vision and text modalities are aligned using a Modality Projection module. This module takes the
image embeddings produced by the vision backbone as input, and transforms them into embeddings
compatible with the text embeddings from the embedding layer of the language model. These embeddings
are then concatenated and fed into the language decoder. The Modality Projection module consists of a
pixel shuffle operation followed by a linear layer.


The architecture of the model (Source: Authors)

Pixel shuffle reduces the variety of image tokens, which helps
reduce computational cost and accelerates training, especially for transformer-based language decoders
that are sensitive to input length. The figure below demonstrates the concept.


Pixel Shuffle Visualized (Source: Authors)

All of the files are very lightweight and well documented. We highly encourage you to envision them out
individually to get a greater understanding of the implementation details (models/xxx.py)

While training, we use the next pre-trained backbone weights:

Vision backbone: google/siglip-base-patch16-224
Language backbone: HuggingFaceTB/SmolLM2-135M

One could also swap out the backbones with other variants of SigLIP/SigLIP 2 (for the vision backbone) and SmolLM2 (for the language backbone).

Train your personal VLM

Now that we’re aware of the architecture, let’s shift gears and discuss methods to train your personal Vision Language Model using train.py.

.
├── data
│   └── ...
├── models
│   └── ...
└── train.py

You’ll be able to kick off training with:

python train.py

This script is your one-stop shop for your complete training pipeline, including:

Dataset loading and preprocessing
Model initialization
Optimization and logging

Configuration

Before the rest, the script loads two configuration classes from models/config.py:

TrainConfig: Configuration parameters useful for training, like learning rates, checkpoint paths, etc.
VLMConfig: The configuration parameters used to initialize the VLM, like hidden dimensions, variety of attention heads, etc.

Data Loading

At the center of the info pipeline is the get_dataloaders function. It:

Loads datasets via Hugging Face’s load_dataset API.
Combines and shuffles multiple datasets (if provided).
Applies a train/val split via indexing.
Wraps them in custom datasets (VQADataset, MMStarDataset) and collators (VQACollator, MMStarCollator).

A helpful flag here is data_cutoff_idx, useful for debugging on small subsets.

Model Initialization

The model is built via the VisionLanguageModel class. For those who’re resuming from a checkpoint, it’s as easy as:

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained(model_path)

Otherwise, you get a freshly initialized model with optionally preloaded backbones for each vision and language.

Optimizer Setup: Two LRs

Since the modality projector (MP) is freshly initialized while the backbones are pre-trained, the
optimizer is split into two parameter groups, each with its own learning rate:

A better LR for the MP
A smaller LR for the encoder/decoder stack

This balance ensures the MP learns quickly while preserving knowledge within the vision and language backbones.

Training Loop

This part is fairly standard but thoughtfully structured:

Mixed precision is used with torch.autocast to enhance performance.
A cosine learning rate schedule with linear warmup is implemented via get_lr.
Token throughput (tokens/sec) is logged per batch for performance monitoring.

Every 250 steps (configurable), the model is evaluated on the validation and MMStar test datasets. If accuracy improves, the model is checkpointed.

Logging & Monitoring

If log_wandb is enabled, training stats like batch_loss, val_loss, accuracy, and tokens_per_second
are logged to Weights & Biases for real-time tracking.

Runs are auto-named using metadata like sample size, batch size, epoch count, learning rates, and the date,
all handled by the helper get_run_name.

Push to Hub

Use the next to push the trained model to the Hub for others to seek out and test:

model.save_pretrained(save_path)

You’ll be able to easily push them using:

model.push_to_hub("hub/id")

Run inference on a pre-trained model

Using nanoVLM because the toolkit, we now have trained a model and published it to Hub.
Now we have used the google/siglip-base-patch16-224 and HuggingFaceTB/SmolLM2-135M as backbones. The model was
trained this for ~6h on a single H100 GPU on ~1.7M samples of the cauldron.

This model is not intended to compete with SoTA models, but slightly to demystify the components and training technique of VLMs.

.
├── data
│   └── ...
├── generate.py     
├── models
│   └── ...
└── ...

Let’s run inference on the trained model using the generate.py script. You’ll be able to run the generation script using the next command:

python generate.py

It will use the default arguments and run the query “What is that this?” on the image assets/image.png.
You should utilize this script on your personal images and prompts like so:

python generate.py --image path/to/image.png --prompt "You prompt here"

If you desire to visualize the center of the script, it’s just these lines:

model = VisionLanguageModel.from_pretrained(source).to(device)
model.eval()

tokenizer = get_tokenizer(model.cfg.lm_tokenizer)
image_processor = get_image_processor(model.cfg.vit_img_size)

template = f"Query: {args.prompt} Answer:"
encoded = tokenizer.batch_encode_plus([template], return_tensors="pt")
tokens = encoded["input_ids"].to(device)

img = Image.open(args.image).convert("RGB")
img_t = image_processor(img).unsqueeze(0).to(device)

print("nInput:n ", args.prompt, "nnOutputs:")
for i in range(args.generations):
    gen = model.generate(tokens, img_t, max_new_tokens=args.max_new_tokens)
    out = tokenizer.batch_decode(gen, skip_special_tokens=True)[0]
    print(f"  >> Generation {i+1}: {out}")

We create the model and set it to eval. Initialize the tokenizer, which tokenizes the text prompt,
and the image processor, which is used to process the pictures. The subsequent step is to process the inputs
and run model.generate to generate the output text. Finally, decode the output using batch_decode.

Image	Prompt	Generation
	What is that this?	In the image I can see the pink color bed sheet. I can see two cats lying on the bed sheet.
	What’s the girl doing?	Here in the center she is performing yoga

If you desire to run inference on the trained model in a UI interface, here is the Hugging Face Space so that you can interact with the model.

Conclusion

On this blog post, we walked through what VLMs are, explored the architecture decisions that power nanoVLM, and unpacked the training and inference workflows intimately.

By keeping the codebase lightweight and readable, nanoVLM goals to function each a learning tool and a foundation you possibly can construct upon. Whether you’re trying to understand how multi-modal inputs are aligned, or you desire to train a VLM on your personal dataset, this repository gives you a head start.

For those who try it out, construct on top of it, or simply have questions we’d love to listen to from you. Blissful tinkering!

References

Source link

The only repository to coach your VLM in pure PyTorch

Table of contents:

TL;DR

What’s a Vision Language Model?

Working with the repository

Architecture

Train your personal VLM

Run inference on a pre-trained model

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

The only repository to coach your VLM in pure PyTorch

Table of contents:

TL;DR

What’s a Vision Language Model?

Working with the repository

Architecture

Train your personal VLM

Run inference on a pre-trained model

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.