Welcome PaliGemma 2 – Recent vision language models by Google

We’re excited to welcome Google’s all-new vision language models, PaliGemma 2, a brand new iteration of PaliGemma. Like its predecessor, PaliGemma 2 uses the identical powerful SigLIP for vision, nevertheless it upgrades to the most recent Gemma 2 for the text decoder part.

PaliGemma 2 comes with latest pre-trained (pt) models, in sizes of 3B, 10B, and 28B parameters. All of them support various input resolutions: 224x224, 448x448, and 896x896. These combos provide plenty of flexibility for various use cases, so practitioners can select the balance they need in the standard / efficiency space. In contrast, the previous PaliGemma was only available within the 3B variant.

The pre-trained models have been designed for straightforward fine-tuning to downstream tasks. The primary PaliGemma was widely adopted by the community for multiple purposes. With the increased flexibility from the extra variants, combined with higher pre-trained quality, we are able to’t wait to see what the community can do that time.

For instance, Google can also be releasing some fine-tuned variants on the DOCCI dataset, demonstrating versatile and robust captioning capabilities which can be long, nuanced and detailed. The fine-tuned DOCCI models can be found for the 3B and 10B variants, and support input resolution of 448×448.

This release includes all of the open model repositories, transformers integration, fine-tuning scripts, and a demo of a model we fine-tuned ourselves for visual query answering on the VQAv2 dataset.

Table of Content

Introducing PaliGemma 2

PaliGemma 2 is a brand new iteration of the PaliGemma vision language model released by Google in May. PaliGemma 2 connects the powerful SigLIP image encoder with the Gemma 2 language model.

The brand new models are based on the Gemma 2 2B, 9B, and 27B language models, leading to the corresponding 3B, 10B, and 28B PaliGemma 2 variants, whose names consider the extra parameters of the (compact) image encoder. As mentioned above, they support three different resolutions, providing great flexibility for fine-tuning to downstream tasks.

PaliGemma 2 is distributed under the Gemma license, which allows for redistribution, business use, fine-tuning and creation of model derivatives.

This release comes with the next checkpoints in bfloat16 precision:

9 pre-trained models: 3B, 10B, and 28B with resolutions of 224x224, 448x448, and 896x896.
2 models fine-tuned on DOCCI: Two models fine-tuned on the DOCCI dataset (image-text caption pairs), supporting the 3B and 10B PaliGemma 2 variants and input resolution of 448x448.

Model Capabilities

As seen with the previous PaliGemma release, the pre-trained (pt) models work great for further fine-tuning on downstream tasks.

The pt models are pre-trained on the next data mixture. The variety of the pre-training dataset allows fine-tuning on downstream tasks in similar domains to be carried out using comparatively fewer examples.

WebLI: An online-scale multilingual image-text dataset built from the general public web. A wide selection of WebLI splits is used to accumulate versatile model capabilities, similar to visual semantic understanding, object localization, visually-situated text understanding, and multilinguality.
CC3M-35L: Curated English image-alt_text pairs from webpages (Sharma et al., 2018). To label this dataset, the authors used Google Cloud Translation API to translate into 34 additional languages.
Visual Query Generation with Query Answering Validation (VQ2A): An improved dataset for query answering. The dataset is translated into the identical additional 34 languages, using the Google Cloud Translation API.
OpenImages: Detection and object-aware questions and answers (Piergiovanni et al. 2022) generated by handcrafted rules on the OpenImages dataset.
WIT: Images and texts collected from Wikipedia (Srinivasan et al., 2021).

The PaliGemma 2 team internally fine-tuned the PT models on a wide selection of visual-language understanding tasks, and so they provide benchmarks of those fine-tuned models within the model card and the technical report.

PaliGemma 2 fine-tuned on the DOCCI dataset, can accomplish a wide selection of captioning tasks, including text rendering, capturing spatial relations, and including world knowledge in captions.

You could find below the performance of the DOCCI fine-tuned PaliGemma 2 checkpoints, compared with other models (taken from Table 6 in the technical report).

	#par	#char	#sent	NES↓
MiniGPT-4	7B	484	5.6	52.3
mPLUG-Owl2	8B	459	4.4	48.4
InstructBLIP	7B	510	4.0	42.6
LLaVA-1.5	7B	395	4.2	40.6
VILA	7B	871	8.6	28.6
PaliGemma	3B	535	8.9	34.3
PaLI-5B	5B	1065	11.3	32.9
PaliGemma 2	3B	529	7.7	28.4
PaliGemma 2	10B	521	7.5	20.3

#char: Average variety of characters within the generated caption.
#sent: Average variety of sentences.
NES: Non entailment sentences (lower is best) that measure factual inaccuracies.

Below you’ll find some model outputs for the DOCCI checkpoint that showcase the flexibility of the model.

Demo

For demonstration purposes, we within the Hugging Face team fine-tuned PaliGemma 2 3B with 448×448 resolution on a small portion of the VQAv2 dataset. We used LoRA fine-tuning and PEFT, as explained later within the fine-tuning section. The demo below showcases the outcome. Be happy to look at the code in the Space to see how it really works, or clone it to adapt to your personal fine-tunes.

Learn how to Use with Transformers

You’ll be able to run inference on the PaliGemma 2 models with 🤗 transformers, using the PaliGemmaForConditionalGeneration and AutoProcessor APIs. Please, make certain you put in transformers version 4.47 or later:

pip install --upgrade transformers

After that, you may run inference like follows. As usual, please make certain to follow the prompt format that was used to coach the model for the duty you’re using:

from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests

model_id = "google/paligemma2-10b-ft-docci-448"
model = PaliGemmaForConditionalGeneration.from_pretrained(model_id)
model = model.to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

prompt = "caption en"
image_file = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/fundamental/cats.png"
raw_image = Image.open(requests.get(image_file, stream=True).raw).convert("RGB")

inputs = processor(prompt, raw_image, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=200)

input_len = inputs["input_ids"].shape[-1]
print(processor.decode(output[0][input_len:], skip_special_tokens=True))

You may also use the transformers bitsandbytes integration to load the models with quantization. The next example uses 4-bit nf4:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)
model = PaligemmaForConditionalGeneration.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map={"":0}
)

We quickly tested performance degradation within the presence of quantization by evaluating a 3B fine-tuned checkpoint on the textvqa dataset, using 224×224 input images. These are the outcomes we got on the 5,000 entries of the validation set:

bfloat16, no quantization: 60.04% accuracy.
8-bit: 59.78%.
4-bit, using the configuration from the snippet above: 58.72%.

These are very encouraging figures! In fact, quantization is most interesting for the larger checkpoints, we recommend you usually measure results on the domains and tasks you’ll be using.

Tremendous-tuning

If you might have previously fine-tuned PaliGemma, the API to fine-tune PaliGemma 2 is similar, you should utilize your code out of the box. We offer a fine-tuning script and a notebook so that you can fine-tune the model, freeze parts of the model, or apply memory efficient fine-tuning techniques like LoRA or QLoRA.

We now have LoRA-fine-tuned a PaliGemma 2 model on half of the VQAv2 validation split for demonstration purposes. This took half an hour on 3 A100s with 80GB VRAM. The model might be found here, and it is a Gradio demo that showcases it.

Conclusion

The brand new PaliGemma 2 release is much more exciting than the previous one, with various sizes fitting everyone’s needs and stronger pre-trained models. We’re looking forward to seeing what the community will construct!

We thank the Google team for releasing this amazing, and open, model family. Big because of Pablo Montalvo for integrating the model to transformers, and to Lysandre, Raushan, Arthur, Yieh-Dar and the remaining of the team for reviewing, testing, and merging very quickly.

Resources

Source link

Welcome PaliGemma 2 – Recent vision language models by Google

Table of Content

Introducing PaliGemma 2

Model Capabilities

Demo

Learn how to Use with Transformers

Tremendous-tuning

Conclusion

Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The best way to Do Evals on a Bloated RAG Pipeline

How good are LLMs at fixing their mistakes? A chatbot arena experiment with Keras and TPUs

Tools for Your LLM: a Deep Dive into MCP

Hugging Face models in Amazon Bedrock

Open Preference Dataset for Text-to-Image Generation by the 🤗 Community

Welcome PaliGemma 2 – Recent vision language models by Google

Table of Content

Introducing PaliGemma 2

Model Capabilities

Demo

Learn how to Use with Transformers

Tremendous-tuning

Conclusion

Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.