nanoVLM is the simplest approach to start with
training your very own Vision Language Model (VLM) using pure PyTorch. It’s lightweight toolkit
which lets you launch a VLM training on a free tier colab notebook.
We were inspired by Andrej Karpathy’s nanoGPT, and supply an identical project for the vision domain.
At its heart, nanoVLM is a toolkit that helps you construct and train a model that may understand each
images and text, after which generate text based on that. The great thing about nanoVLM lies in its simplicity.
Your entire codebase is intentionally kept minimal and readable, making it perfect for beginners or
anyone who desires to peek under the hood of VLMs without getting overwhelmed.
On this blog post, we cover the core ideas behind the project and supply an easy approach to interact
with the repository. We not only go into the main points of the project but additionally encapsulate all of it
so you can quickly start.
Table of contents:
TL;DR
You’ll be able to start training a Vision Language Model using our nanoVLM toolkit by following these steps:
git clone https://github.com/huggingface/nanoVLM.git
python train.py
Here’s a Colab notebook
that can enable you launch a training run with no local setup required!
What’s a Vision Language Model?
Because the name suggests, a Vision Language Model (VLM) is a multi-modal model that processes two
modalities: vision and text. These models typically take images and/or text as input and generate text as output.
Generating text (output) conditioned on the understanding of images and texts (inputs) is a strong paradigm.
It enables a wide selection of applications, from image captioning and object detection to answering
questions on visual content (as shown within the table below). One thing to notice is that nanoVLM
focuses only on Visual Query Answering because the training objective.
![]() |
Caption the image | Two cats lying down on a bed with remotes near them | Captioning |
| Detect the objects within the image | |
Object Detection | |
| Segment the objects within the image | |
Semantic Segmentation | |
| What number of cats are within the image? | 2 | Visual Query Answering |
For those who are focused on learning more about VLMs, we strongly recommend reading our latest blog on the subject: Vision Language Models (Higher, Faster, Stronger)
Working with the repository
“Talk is affordable, show me the code” – Linus Torvalds
On this section, we’ll guide you thru the codebase. It’s helpful to maintain a
tab open for reference as you follow along.
Below is the folder structure of our repository. Now we have removed helper files for brevity.
.
├── data
│ ├── collators.py
│ ├── datasets.py
│ └── processors.py
├── generate.py
├── models
│ ├── config.py
│ ├── language_model.py
│ ├── modality_projector.py
│ ├── utils.py
│ ├── vision_language_model.py
│ └── vision_transformer.py
└── train.py
Architecture
.
├── data
│ └── ...
├── models
│ └── ...
└── train.py
We model nanoVLM after two well-known and widely used architectures. Our vision backbone
(models/vision_transformer.py) is the usual vision transformer, more specifically Google’s
SigLIP vision encoder. Our language
backbone follows the Llama 3 architecture.
The vision and text modalities are aligned using a Modality Projection module. This module takes the
image embeddings produced by the vision backbone as input, and transforms them into embeddings
compatible with the text embeddings from the embedding layer of the language model. These embeddings
are then concatenated and fed into the language decoder. The Modality Projection module consists of a
pixel shuffle operation followed by a linear layer.
Pixel shuffle reduces the variety of image tokens, which helps
reduce computational cost and accelerates training, especially for transformer-based language decoders
that are sensitive to input length. The figure below demonstrates the concept.
All of the files are very lightweight and well documented. We highly encourage you to envision them out
individually to get a greater understanding of the implementation details (models/xxx.py)
While training, we use the next pre-trained backbone weights:
- Vision backbone:
google/siglip-base-patch16-224 - Language backbone:
HuggingFaceTB/SmolLM2-135M
One could also swap out the backbones with other variants of SigLIP/SigLIP 2 (for the vision backbone) and SmolLM2 (for the language backbone).
Train your personal VLM
Now that we’re aware of the architecture, let’s shift gears and discuss methods to train your personal Vision Language Model using train.py.
.
├── data
│ └── ...
├── models
│ └── ...
└── train.py
You’ll be able to kick off training with:
python train.py
This script is your one-stop shop for your complete training pipeline, including:
- Dataset loading and preprocessing
- Model initialization
- Optimization and logging
Configuration
Before the rest, the script loads two configuration classes from models/config.py:
TrainConfig: Configuration parameters useful for training, like learning rates, checkpoint paths, etc.VLMConfig: The configuration parameters used to initialize the VLM, like hidden dimensions, variety of attention heads, etc.
Data Loading
At the center of the info pipeline is the get_dataloaders function. It:
- Loads datasets via Hugging Face’s
load_datasetAPI. - Combines and shuffles multiple datasets (if provided).
- Applies a train/val split via indexing.
- Wraps them in custom datasets (
VQADataset,MMStarDataset) and collators (VQACollator,MMStarCollator).
A helpful flag here is
data_cutoff_idx, useful for debugging on small subsets.
Model Initialization
The model is built via the VisionLanguageModel class. For those who’re resuming from a checkpoint, it’s as easy as:
from models.vision_language_model import VisionLanguageModel
model = VisionLanguageModel.from_pretrained(model_path)
Otherwise, you get a freshly initialized model with optionally preloaded backbones for each vision and language.
Optimizer Setup: Two LRs
Since the modality projector (MP) is freshly initialized while the backbones are pre-trained, the
optimizer is split into two parameter groups, each with its own learning rate:
- A better LR for the MP
- A smaller LR for the encoder/decoder stack
This balance ensures the MP learns quickly while preserving knowledge within the vision and language backbones.
Training Loop
This part is fairly standard but thoughtfully structured:
- Mixed precision is used with
torch.autocastto enhance performance. - A cosine learning rate schedule with linear warmup is implemented via
get_lr. - Token throughput (tokens/sec) is logged per batch for performance monitoring.
Every 250 steps (configurable), the model is evaluated on the validation and MMStar test datasets. If accuracy improves, the model is checkpointed.
Logging & Monitoring
If log_wandb is enabled, training stats like batch_loss, val_loss, accuracy, and tokens_per_second
are logged to Weights & Biases for real-time tracking.
Runs are auto-named using metadata like sample size, batch size, epoch count, learning rates, and the date,
all handled by the helper get_run_name.
Push to Hub
Use the next to push the trained model to the Hub for others to seek out and test:
model.save_pretrained(save_path)
You’ll be able to easily push them using:
model.push_to_hub("hub/id")
Run inference on a pre-trained model
Using nanoVLM because the toolkit, we now have trained a model and published it to Hub.
Now we have used the google/siglip-base-patch16-224 and HuggingFaceTB/SmolLM2-135M as backbones. The model was
trained this for ~6h on a single H100 GPU on ~1.7M samples of the cauldron.
This model is not intended to compete with SoTA models, but slightly to demystify the components and training technique of VLMs.
.
├── data
│ └── ...
├── generate.py
├── models
│ └── ...
└── ...
Let’s run inference on the trained model using the generate.py script. You’ll be able to run the generation script using the next command:
python generate.py
It will use the default arguments and run the query “What is that this?” on the image assets/image.png.
You should utilize this script on your personal images and prompts like so:
python generate.py --image path/to/image.png --prompt "You prompt here"
If you desire to visualize the center of the script, it’s just these lines:
model = VisionLanguageModel.from_pretrained(source).to(device)
model.eval()
tokenizer = get_tokenizer(model.cfg.lm_tokenizer)
image_processor = get_image_processor(model.cfg.vit_img_size)
template = f"Query: {args.prompt} Answer:"
encoded = tokenizer.batch_encode_plus([template], return_tensors="pt")
tokens = encoded["input_ids"].to(device)
img = Image.open(args.image).convert("RGB")
img_t = image_processor(img).unsqueeze(0).to(device)
print("nInput:n ", args.prompt, "nnOutputs:")
for i in range(args.generations):
gen = model.generate(tokens, img_t, max_new_tokens=args.max_new_tokens)
out = tokenizer.batch_decode(gen, skip_special_tokens=True)[0]
print(f" >> Generation {i+1}: {out}")
We create the model and set it to eval. Initialize the tokenizer, which tokenizes the text prompt,
and the image processor, which is used to process the pictures. The subsequent step is to process the inputs
and run model.generate to generate the output text. Finally, decode the output using batch_decode.
| Image | Prompt | Generation |
|---|---|---|
![]() |
What is that this? | In the image I can see the pink color bed sheet. I can see two cats lying on the bed sheet. |
![]() |
What’s the girl doing? | Here in the center she is performing yoga |
If you desire to run inference on the trained model in a UI interface, here is the Hugging Face Space so that you can interact with the model.
Conclusion
On this blog post, we walked through what VLMs are, explored the architecture decisions that power nanoVLM, and unpacked the training and inference workflows intimately.
By keeping the codebase lightweight and readable, nanoVLM goals to function each a learning tool and a foundation you possibly can construct upon. Whether you’re trying to understand how multi-modal inputs are aligned, or you desire to train a VLM on your personal dataset, this repository gives you a head start.
For those who try it out, construct on top of it, or simply have questions we’d love to listen to from you. Blissful tinkering!



