PaliGemma – Google’s Cutting-Edge Open Vision Language Model

Updated on 23-05-2024: Now we have introduced a number of changes to the transformers PaliGemma implementation around fine-tuning, which you could find on this notebook.

PaliGemma is a brand new family of vision language models from Google. PaliGemma can soak up a picture and a text and output text.

The team at Google has released three forms of models: the pretrained (pt) models, the combo models, and the fine-tuned (ft) models, each with different resolutions and available in multiple precisions for convenience.

All models are released within the Hugging Face Hub model repositories with their model cards and licenses and have transformers integration.

What’s PaliGemma?

PaliGemma (Github) is a family of vision-language models with an architecture consisting of SigLIP-So400m because the image encoder and Gemma-2B as text decoder. SigLIP is a state-of-the-art model that may understand each images and text. Like CLIP, it consists of a picture and text encoder trained jointly. Much like PaLI-3, the combined PaliGemma model is pre-trained on image-text data and may then easily be fine-tuned on downstream tasks, comparable to captioning, or referring segmentation. Gemma is a decoder-only model for text generation. Combining the image encoder of SigLIP with Gemma using a linear adapter makes PaliGemma a robust vision language model.

The PaliGemma release comes with three forms of models:

PT checkpoints: Pretrained models that could be fine-tuned to downstream tasks.
Mix checkpoints: PT models fine-tuned to a combination of tasks. They’re suitable for general-purpose inference with free-text prompts, and could be used for research purposes only.
FT checkpoints: A set of fine-tuned models, each specialized on a unique academic benchmark. They can be found in various resolutions and are intended for research purposes only.

The models are available in three different resolutions (224x224, 448x448, 896x896) and three different precisions (bfloat16, float16, and float32). Each repository incorporates the checkpoints for a given resolution and task, with three revisions for every of the available precisions. The fundamental branch of every repository incorporates float32 checkpoints, whereas the bfloat16 and float16 revisions contain the corresponding precisions. There are separate repositories for models compatible with 🤗 transformers, and with the unique JAX implementation.

As explained intimately further down, the high-resolution models require quite a bit more memory to run, since the input sequences are for much longer. They might help with fine-grained tasks comparable to OCR, but the standard increase is small for many tasks. The 224 versions are perfectly tremendous for many purposes.

You will discover all of the models and Spaces on this collection.

Model Capabilities

PaliGemma is a single-turn vision language model not meant for conversational use, and it really works best when fine-tuning to a selected use case.

You’ll be able to configure which task the model will solve by conditioning it with task prefixes, comparable to “detect” or “segment”. The pretrained models were trained on this fashion to imbue them with a wealthy set of capabilities (query answering, captioning, segmentation, etc.). Nonetheless, they should not designed for use directly, but to be transferred (by fine-tuning) to specific tasks using an analogous prompt structure. For interactive testing, you should utilize the “mix” family of models, which have been fine-tuned on a combination of tasks.

The examples below use the combo checkpoints to exhibit a number of the capabilities.

Image Captioning

PaliGemma can caption images when prompted to. You’ll be able to try various captioning prompts with the combo checkpoints to see how they respond.

Visual Query Answering

PaliGemma can answer questions on a picture, simply pass your query together with the image to accomplish that.

Detection

PaliGemma can detect entities in a picture using the detect [entity] prompt. It’ll output the placement for the bounding box coordinates in the shape of special tokens, where value is a number that represents a normalized coordinate. Each detection is represented by 4 location coordinates within the order y_min, x_min, y_max, x_max, followed by the label that was detected in that box. To convert values to coordinates, you first have to divide the numbers by 1024, then multiply y by the image height and x by its width. This will provide you with the coordinates of the bounding boxes, relative to the unique image size.

Referring Expression Segmentation

PaliGemma mix checkpoints also can segment entities in a picture when given the segment [entity] prompt. This is known as referring expression segmentation, because we discuss with the entities of interest using natural language descriptions. The output is a sequence of location and segmentation tokens. The placement tokens represent a bounding box as described above. The segmentation tokens could be further processed to generate segmentation masks.

Document Understanding

PaliGemma mix checkpoints have great document understanding and reasoning capabilities.

Mix Benchmarks

Below you could find the scores for mix checkpoints.

Model	MMVP Accuracy	POPE Accuracy (random/popular/adversarial)
mix-224	46.00	88.00 86.63 85.67
mix-448	45.33	89.37 88.40 87.47

High quality-tuned Checkpoints

Along with the pretrained and blend models, Google has released models already transferred to varied tasks. They correspond to academic benchmarks that could be utilized by the research community to check how they perform. Below, you could find a specific few. These models also are available in different resolutions. You’ll be able to try the model card of any model for all metrics.

Demo

As a part of this release we have now a demo that wraps the reference implementation within the big_vision repository and provides a straightforward option to mess around with the combo models.

We even have a version of the demo compatible with Transformers, to point out easy methods to use the PaliGemma transformers API.

The best way to Run Inference

To acquire access to the PaliGemma models, you must accept the Gemma license terms and conditions. Should you have already got access to other Gemma models in Hugging Face, you’re good to go. Otherwise, please visit any of the PaliGemma models, and accept the license should you agree with it. Once you may have access, you must authenticate either through notebook_login or huggingface-cli login. After logging in, you’ll be good to go!

You too can try inference on this notebook instantly.

Using Transformers

You should use the PaliGemmaForConditionalGeneration class to infer with any of the released models. Simply preprocess the prompt and the image with the built-in processor, after which pass the preprocessed inputs for generation.

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration

model_id = "google/paligemma-3b-mix-224"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

prompt = "What's on the flower?"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/fundamental/bee.jpg?download=true"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
inputs = processor(prompt, raw_image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=20)

print(processor.decode(output[0], skip_special_tokens=True)[len(prompt):])

You too can load the model in 4-bit as follows.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0}
)

Along with 4-bit (or 8-bit) loading, the transformers integration means that you can leverage other tools within the Hugging Face ecosystem, comparable to:

Detailed Inference Process

If you desire to write your individual pre-processing or training code or would really like to grasp in additional detail how PaliGemma works, these are the steps that the input image and text undergo.

The input text is tokenized normally. A token is added initially, and a further newline token (n) is appended. This newline token is a necessary a part of the input prompt the model was trained with, so adding it explicitly ensures it is usually there. The tokenized text can be prefixed with a set variety of tokens. What number of? It will depend on the input image resolution and the patch size utilized by the SigLIP model. PaliGemma models are pre-trained on one in all three square sizes (224×224, 448×448, or 896×896), and all the time use a patch size of 14. Subsequently, the variety of tokens to prepend is 256 for the 224 models (224/14 * 224/14), 1024 for the 448 models, and 4096 for the 896 models.

Note that larger images lead to for much longer input sequences, and subsequently require quite a bit more memory to undergo the language portion of the model. Keep this in mind when considering what model to make use of. For finer-grained tasks, comparable to OCR, larger images may help achieve higher results, however the incremental quality is small for the overwhelming majority of tasks. Do test in your tasks before deciding to maneuver to a bigger resolution!

This entire “prompt” goes through the text embeddings layer of the language model and generates token embeddings with 2048 dimensions per token.

In parallel with this, the input image is resized, using bicubic resampling, to the required input size (224×224 for the smallest-resolution models). Then it goes through the SigLIP Image Encoder to generate image embeddings with 1152 dimensions per patch. That is where the linear projector comes into play: the image embeddings are projected to acquire representations with 2048 dimensions per patch, same because the ones obtained from the text tokens. The ultimate image embeddings are then merged with the text embeddings, and that is the ultimate input that’s used for autoregressive text generation. Generation works normally in autoregressive mode. It uses full block attention for the whole input (image + bos + prompt + n), and a causal attention mask for the generated text.

All of those details are taken care of routinely within the processor and model classes, so inference could be performed using the familiar high-level transformers API shown within the previous examples.

High quality-tuning

Using big_vision

PaliGemma was trained within the big_vision codebase. The identical codebase was already used to develop models like BiT, the unique ViT, LiT, CapPa, SigLIP, and plenty of more.

The project config folder configs/proj/paligemma/ incorporates a README.md. The pretrained model could be transferred by running config files within the transfers/ subfolder, and all our transfer results were obtained by running the configs provided therein. If you desire to transfer your individual model, fork the instance config transfers/forkme.py and follow the instructions within the comments to adapt it to your usecase.

There’s also a Colab finetune_paligemma.ipynb that runs a simplified fine-tuning that works on a free T4 GPU runtime. To suit on the limited host and GPU memory, the code within the Colab only updates the weights in the eye layers (170M params) and uses SGD (as a substitute of Adam).

Using transformers

High quality-tuning PaliGemma may be very easy, due to transformers. One also can do QLoRA or LoRA fine-tuning. In this instance, we’ll briefly fine-tune the decoder, after which show easy methods to switch to QLoRA fine-tuning.
We are going to install the most recent version of the transformers library.

pip install transformers

Identical to on the inference section, we’ll authenticate to access the model using notebook_login().

from huggingface_hub import notebook_login
notebook_login()

For this instance, we’ll use the VQAv2 dataset, and fine-tune the model to reply questions on images. Let’s load the dataset. We are going to only use the columns query, multiple_choice_answer and image, so let’s remove the remainder of the columns as well. We will even split the dataset.

from datasets import load_dataset 
ds = load_dataset('HuggingFaceM4/VQAv2', split="train") 
cols_remove = ["question_type", "answers", "answer_type", "image_id", "question_id"] 
ds = ds.remove_columns(cols_remove)
ds = ds.train_test_split(test_size=0.1)
train_ds = ds["train"]
val_ds = ds["test"]

We are going to now load the processor, which incorporates the image processing and tokenization part, and preprocess our dataset.

from transformers import PaliGemmaProcessor 
model_id = "google/paligemma-3b-pt-224"
processor = PaliGemmaProcessor.from_pretrained(model_id)

We are going to create a prompt template to condition PaliGemma to reply visual questions. For the reason that tokenizer pads the inputs, we want to set the pads in our labels to something aside from the pad token within the tokenizer, in addition to the image token.

import torch
device = "cuda"

image_token = processor.tokenizer.convert_tokens_to_ids("")
def collate_fn(examples):
  texts = ["answer " + example["question"] for example in examples]
  labels= [example['multiple_choice_answer'] for example in examples]
  images = [example["image"].convert("RGB") for example in examples]
  tokens = processor(text=texts, images=images, suffix=labels,
                    return_tensors="pt", padding="longest")

  tokens = tokens.to(torch.bfloat16).to(device)
  return tokens

You’ll be able to either load the model directly or load the model in 4-bit for QLoRA. Below you’ll be able to see easy methods to load the model directly. We are going to load the model, and freeze the image encoder and the projector, and only fine-tune the decoder. In case your images are inside a selected domain, which could not be within the dataset the model was pre-trained with, it is advisable to skip freezing the image encoder.

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16).to(device)

for param in model.vision_tower.parameters():
    param.requires_grad = False

for param in model.multi_modal_projector.parameters():
    param.requires_grad = True

If you desire to load model in 4-bit for QLoRA, you’ll be able to add the next changes below.

from transformers import BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_type=torch.bfloat16
)

lora_config = LoraConfig(
    r=8, 
    target_modules=["q_proj", "o_proj", "k_proj", "v_proj", "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0})
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

We are going to now initialize the Trainer and TrainingArguments. Should you will do QLoRA fine-tuning, set the optimizer to paged_adamw_8bit as a substitute.

from transformers import TrainingArguments
args=TrainingArguments(
            num_train_epochs=2,
            remove_unused_columns=False,
            per_device_train_batch_size=16,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            learning_rate=2e-5,
            weight_decay=1e-6,
            adam_beta2=0.999,
            logging_steps=100,
            optim="adamw_hf",
            save_strategy="steps",
            save_steps=1000,
            push_to_hub=True,
            save_total_limit=1,
            bf16=True,
            report_to=["tensorboard"],
            dataloader_pin_memory=False
        )

Initialize Trainer, pass within the datasets, data collating function and training arguments, and call train() to start out training.

trainer = Trainer(
        model=model,
        train_dataset=train_ds,
        eval_dataset=val_ds,
        data_collator=collate_fn,
        args=args
        )
trainer.train()

Additional Resources

We would really like to thank Omar Sanseviero, Lucas Beyer, Xiaohua Zhai and Matthias Minderer for his or her thorough reviews on this blog post. We would really like to thank Peter Robicheaux for his or her help with fine-tuning changes in transformers.

Source link

PaliGemma – Google’s Cutting-Edge Open Vision Language Model

What’s PaliGemma?

Model Capabilities

Image Captioning

Visual Query Answering

Detection

Referring Expression Segmentation

Document Understanding

Mix Benchmarks

High quality-tuned Checkpoints

Demo

The best way to Run Inference

Using Transformers

Detailed Inference Process

High quality-tuning

Using big_vision

Using transformers

Additional Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Redefining Secure AI Infrastructure with NVIDIA BlueField Astra for NVIDIA Vera Rubin NVL72

Text2SQL using Hugging Face Dataset Viewer API and Motherduck DuckDB-NSQL-7B

HNSW at Scale: Why Your RAG System Gets Worse because the Vector Database Grows

Public Policy at Hugging Face

LLMs contain a LOT of parameters. But what’s a parameter?

PaliGemma – Google’s Cutting-Edge Open Vision Language Model

What’s PaliGemma?

Model Capabilities

Image Captioning

Visual Query Answering

Detection

Referring Expression Segmentation

Document Understanding

Mix Benchmarks

High quality-tuned Checkpoints

Demo

The best way to Run Inference

Using Transformers

Detailed Inference Process

High quality-tuning

Using big_vision

Using transformers

Additional Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.