How Vision Language Models Are Trained from “Scratch”

to remodel a small text-only language model and gift it the ability of vision. This text is to summarize all my learnings, and take a deeper have a look at the network architectures behind modern Vision Language Models.

Wait, are you actually going to be “training from scratch”?

Yes… I mean no… .

Research labs in 2026 don’t train multimodal models from “scratch” anymore. It is just too expensive to show a model vision and (textual) language at the identical time! It requires more data, compute, time, and money. Also, it often results in poorer results.

As a substitute, labs take existing pretrained text-only language models, and finetune it to provide “vision capabilities”. In theory (and practice), that is far more compute-efficient.

Let’s speak about Vision Language Models!

The usual architecture

Even though it’s less data-intensive, finetuning text-only LMs to begin seeing images surely open its own can of worms.

how can we embed the image, i.e. convert it into numerical representations that a neural network can understand?
how can we tune the image embeddings to be compatible with text?
how can we adjust the weights of the text model in order that it retains it’s previous world knowledge, but additionally generate text from image embeddings?

The image embeddings coming out of the VIT and Q-Former passes through an MLP layer. Followed by a series of Decoder layer and trainable LORA adapters.

These modules are:

The Image Backbone: A model that converts raw images into embeddings.
The Adapter Layer: These are models that convert the image embeddings right into a “text-compatible” embedding. That is the fundamental difficult part – what architectures to make use of, what loss functions, etc.
The Language Layer: The language model we are going to train to input the adapted embeddings and generate text from it.

Let’s discuss them one after the other.

1. The Image Backbone

The goal of your image backbone is easy:

Input: A raw 2D pixel map/image.

Output: A sequence of vector embeddings representing the image

Normally, we just use an off-the-shelf image model that has been pretrained with massive corpus of images, generally on self-supervised tasks.

You should use a Convolutional Neural Network (like a ResNet) to make use of as a picture backbone. But modern state-of-the-art VLMs have almost entirely shifted to ViTs because they scale higher with data and are more flexible for multimodal fusion.

Vision Transformers input a picture, extract patches out of it, after which pass them through bidirectional self-attention layers. This contextualizes each patch embedding with one another to form a contextual sequence of image embeddings.

In most VLM research, there may be a transparent trend toward keeping backbones static (frozen) to save lots of costs. Also, vision-language training generally needs paired image-text datasets. Since these datasets are at all times much smaller than the VIT’s pretraining dataset, finetuning the backbone often results in overfitting and degraded performance.

By keeping these weights frozen, we’re principally transferring the ownership of vision-language learning to latter parts of the network (i.e. the adapter layer and the text backbone).

In my experiments, I used the ViT-Base model. This model takes the image input, splits it into patches of 16×16 images and applies self-attention on them to generate an embedding sequence of 197 vectors. Each vector is 768 dimesnions long (the embedding size of the VIT).

2. The Adapter Layer

That is where we’re going to spend the vast majority of our time. We’ve got converted images into embeddings already, but those embeddings are completely text-unaware.

Vision Transformers are pre-trained purely on image pixels. Not on their captions, or any local textual features. The role of the adapter is to ground the pixel-based-image-embeddings right into a (often shorter sequence of) text-based-image-embeddings.

There are numerous ways to do that, like using CLIP models, but we’re going to take a look at considered one of the more popular approaches — the Query Former of the Q-Former.

Q-Former

Alright — so what’s a Q-Former? A Q-Former or the Query-Former was introduced within the BLIP-2 paper.

How do I train a Q-Former?

Standard Q-Formers might be trained using any multimodal image-text pair dataset. For instance, you need to use the Conceptual Captions dataset, which is an enormous corpus of images and their corresponding captions. In my project, I took just 50,000 pairs to coach the Q-Former.

You’ll be able to train a Q-Former from scratch, however the BLIP-2 suggestion is to make use of an pretrained BERT model. In order that’s what we are going to do.

At a high level, here is our basic gameplan:

Train a multi-modal joint embedding space. An area where text and pictures “know” one another.
Mainly, we are going to input pairs of image and captions — and embed each of them in the identical joint space.
Images and incompatible captions can be mapped in separate places on this latest embedding space, and compatible captions can be mapped close to one another.

Layer 2: Self-Attention & Cross-Attention with the VIT Features

Organising cross-attention layers

There’s an issue — BERT models are purely text models. They don’t know what a picture is.

So, our objective is to first introduce latest cross-attention layers to marry the vision embeddings coming out of the VIT and the text embeddings from BERT. Let’s break down step-by-step how we convert BERT right into a Q-Former:

Sample a picture and text pair from the dataset
Pass the image through the frozen VIT model to convert the image into image embeddings, shaped [197, 768]
Initialize “learnable query embeddings”. These are (say 32) vector embeddings that we are going to use to convert the image embedding sequence right into a text-grounded token embedding sequence. Notice that 32 is way lower than the unique VIT embedding sequence length (197).
We input the text caption embeddings and the query embeddings into the primary BERT layer. The layer applies self-attention on these inputs.
For now, let’s assume that the query tokens only attend amongst themselves and the text tokens amongst themselves (i.e. the query tokens and the text tokens don’t one another).
Within the 2nd layer of the BERT, something INTERESTING happens. The 2 sets of embeddings undergo one other self-attention layer like before. But this time, we also use a cross-attention layer to contextualize the query embeddings with the ViT image embeddings we calculated earlier.
After all, normal BERT doesn’t have any cross-attention layers, so we introduce these multi-headed cross-attention layers ourselves.
Similar to this, we alternate between a pure self-attention layer (where queries and text independently self-attend amongst themselves) followed by a cross-attention layer (where the query embeddings attend to the frozen VIT embeddings).
In the ultimate layer, we elect a joint embedding training loss, like ITC (Image Text Contrastive Loss) or ITM (Image Text Matching Loss) or ITG (Image Text Generation Loss etc). More on this later.

What does cross-attention do?

It contextualizes the image content with the query embeddings. You’ll be able to imagine each query is attempting to match a particular embedding pattern against the 197 VIT embeddings.

For instance, if a question has a high match with a single image vector, it can capture that feature very prominently. If the query matches with a mix of vectors, it can be a median of those embeddings, and so forth.

Remember we trained 32 of those query embeddings, so you’re allowing the Q-Former to learn multiple different co-activations throughout the image embeddings. Attributable to the character of coaching, these coactivations are encouraged to maximise alignment between image and text.

As we train the Q-Former, each the initial query embeddings and the cross-attention weights can be optimized so we will extract relevant features from the VIT image tokens.

The query former embeddings are usually not attempting to capture every detail of those 197 embeddings — as a substitute they try to learn find out how to mix them right into a compact 32 token sequence.

Note that after the Q-Former is trained, we won’t actually use the text a part of the Q-Former for anything. We’ll simply pass the query-embeddings through the Q-former and alternatively run self-attention and cross-attention only on them.

Loss functions for training Q-Formers

There may be SO MUCH COOL SHIT you possibly can just do by configuring the eye mask in alternative ways.

So icydk, I can be training a small text-only LM to have vision capabilities. As a pretraining step, I want to first train a joint image-text embedding space which can be later used… https://t.co/Dxf2Q2hhBG pic.twitter.com/EFmWsWRbgA

— AVB (@neural_avb) December 16, 2025

How the Q-Former model is trained is definitely closely related to how we attend between query and text tokens throughout the layers.

Image-Text-Contrastive Loss (our setup)
For this task, we use a unimodal self-attention mask. Query tokens attend amongst one another, text token amongst one another.
The loss function might be any standard CLIP-like contrastive loss function. We’ll align the image and text in the identical embedding space. Mainly, we take the output of the queries and the output of the text encoder and compute their similarity. If a picture and a caption belong together, we would like their vectors to be as close as possible.

This forces the queries to extract a “global” visual representation that matches the final theme of the text without actually taking a look at the words yet.
Image-Text Matching Loss (ITM)
This uses a bi-directional self-attention mask. Here, every query token is allowed to see every text token, and each text token can see every query!
For the loss fucntion, we use a binary classification task where the model has to predict: “Is that this image and this text a match—Yes or No?”. Binary cross-entropy loss.

Since the modalities are fully mixed, the model can do fine-grained comparisons. The queries can have a look at specific objects within the image (via the cross-attention) and confirm in the event that they correspond to specific words within the text. That is way more detailed than the contrastive loss and ensures the 32 tokens are capturing localized details.
Image-Text Generation Loss (ITG)
Finally, we’ve got the generative task. For this, we use a multimodal causal mask. The queries can still see one another, however the text tokens at the moment are treated like a sequence. Each text token can see all 32 query tokens – which act as a visible prefix. But they will only see the text tokens that got here it.
For the loss function, we just train the model to predict the subsequent token within the caption. By forcing the model to literally “write” the outline based on the queries, we be sure that those 32 tokens contain every little bit of visual information crucial for a language model to know the scene.

For my project, I just used the only — ITC. For a small dataset like I used to be using, this was the easiest method! BLIP-2 recommends to make use of a mix of all these training methods. The github repo shared at the top of the article provides the recipe to make use of any of the above attention schemes.

A trained Q-Former model learns to match similar text and image pairs

In the subsequent section, we are going to do the ultimate step — training the VLM!

3. The Language Layer

Now comes the ultimate step. We’ll use the VIT and the Q-Former to make a language model right into a vision model. I picked considered one of the smallest instruction-tuned language models — the SmolLM2-135M. Thankfully, this part will not be as complicated because the Q-Former training.

We’ve got the image embeddings (coming from the VIT and the Q-Former), and we’ve got the text tokens (coming from the SmolLM tokenizer). Let’s see some details.

We sample a picture and a caption from our dataset
We randomly pick from a listing of easy system prompts, just like: “.”
We also pick the user query from a listing of prompts, for instance: ““

We tokenize the output captions sampled from the dataset as well.

These 3 things form the text tokens. We tokenize all of them using the SmolLM2 tokenizer, but we are usually not going to insert it into the LLM just yet — we must process the image first.
We pass the image through the frozen VIT, then through the Q-Former (again, note that the text captions are usually not passed into the Q-Former, only the image pipeline is executed)
We introduce a small MLP layer that converts the Q-Former output into latest embeddings which are of the identical shape because the LLM’s expected embedding size. As we train, this MLP layer will map the Q-Former embedding into the LLM embedding space.
Now that we’ve got the text tokens sequence and the brand new image embeddings (VIT -> Q-Former -> MLP). We’ll pass the text tokens through the LLM’s native embedding layer. We sandwich the text and image embeddings in the next sequence:

…and forward pass it through the remaining of the LLM.
Why that specific sequence? Since autoregressive LLMs use causal masking, we are going to essentially be training models to generate the output (caption) sequence given the whole prefix (system prompt, user prompt, and the image embeddings).
We add LoRA adapters (Low-Rank Adaptation Matrices) as a substitute of coaching the whole LLM from scratch. By wrapping our LLM with LoRA, we freeze the unique tens of millions of parameters and only train tiny, low-rank matrices injected into the eye layers. This makes the model trainable on consumer hardware while keeping all that pre-existing intelligence intact.
And that’s it! We pass these stitched embeddings and labels into the LLM. The model attends to the text instruction the visual tokens concurrently, and due to LoRA, it learns find out how to update its internal wiring to know this latest visual language. Only the Q-Former layers, the MLP layer, and the LORA adapters are trained. All the things else is kept frozen.

Some results! You’ll find more leads to the youtube video mentioned at the top of the article!

After training this for just just a few hours on a small subset of knowledge, the trained VLM can now see the pictures and generate text about it. Machine Learning is so beautiful when it really works.

In summary

You’ll find the complete github repository here:

https://github.com/avbiswas/vlm

And watch the youtube video here:

Let’s summarize all of the modules in Vision Language pipelines.

A vision backbone (just like the VIT) that takes image input and converts it into embeddings
An adapter layer (just like the Q-Former) that grounds the image with text
An LLM that we train to consolidate the text and image embeddings to learn the language of vision

My Patreon:
https://www.patreon.com/NeuralBreakdownwithAVB

My YouTube channel:
https://www.youtube.com/@avb_fj

Follow me on Twitter:
https://x.com/neural_avb

I’m constructing Paper Breakdown, a spot to review research papers
https://paperbreakdown.com

Read my articles:
https://towardsdatascience.com/writer/neural-avb/

How Vision Language Models Are Trained from “Scratch”

The usual architecture

1. The Image Backbone