PaliGemma 2 Mix – Recent Instruction Vision Language Models by Google

Last December, Google released PaliGemma 2: a brand new family of pre-trained (pt) PaliGemma vision language models (VLMs) based on SigLIP and Gemma 2. The models are available in three different sizes (3B, 10B, 28B) and three different resolutions (224×224, 448×448, 896×896).

Today, Google is releasing PaliGemma 2 mix: fine-tuned on a combination of vision language tasks, including OCR, long and short captioning and more.

PaliGemma 2 pretrained (pt) variants are great vision language models to transfer on a given task at hand. All pt checkpoints are supposed to be fine-tuned on a downstream task and were released for that purpose.

The combination models give a fast idea of the performance one would get when fine-tuning the pre-trained checkpoints on a downstream task. The predominant purpose of the PaliGemma model family is to supply pretrained models that may learn higher on a downstream task, as an alternative of providing a flexible chat model. Mix models give a great signal of how pt models perform when fine-tuned on a combination of educational datasets.

You may read more about PaliGemma 2 on this blog post.

Yow will discover all the combination models and the demo on this collection.

PaliGemma 2 Mix Models

PaliGemma 2 mix models can accomplish a wide range of tasks. We will categorize them in response to their subtasks as follows.

General vision-language related tasks: visual query answering, referring to pictures
Document understanding: visual query answering on infographics, charts, and diagram understanding
Text recognition in images: Text detection, captioning images with texts in them, visual query answering on images with text
Localization-related tasks: object detection, image segmentation

Note that this list of subtasks is non-exhaustive, and you may get more information on the complete list of tasks within the PaliGemma 2 paper.

When prompting PaliGemma 2 mix models, we are able to use open-ended prompts. Within the previous iteration of PaliGemma pretrained models, we would have liked so as to add a task prefix to the prompt depending on the duty we’d wish to accomplish in a given language. This still works, but open-ended prompts yield higher performance. Prompts with task prefix appear like the next:

“caption {lang}”: Nice, COCO-like short captions
“describe {lang}”: Longer, more descriptive captions
“ocr”: Optical character recognition
“answer {lang} {query}”: Query answering in regards to the image contents
“query {lang} {answer}”: Query generation for a given answer

Only two tasks that work solely with task prefixes are object detection and image segmentation. The prompts appear like the next.

“detect {object description}”: Locate listed objects in a picture and return the bounding boxes for those objects
“segment {object description}; {object description}”: Locate the world occupied by the thing in a picture to create a picture segmentation for that object

If you wish to immediately start, be happy to leap to this section of the blog, or try the demo.

Comparing PaliGemma 2 Mix Variants

On this section, we’ll review the aforementioned capabilities, how PaliGemma 2 mix performs on them, and compare different variants with different sizes and resolutions on a couple of of the tasks. Here, we’re testing the model on a couple of in-the-wild examples.

General Vision-Language Tasks

Input Image	Input Prompt	3B/448 Response	10B/448 Response
	What number of candies are there?	5 ✅	5 ✅
	Describe the image in great detail	A gaggle of individuals walk down a street, their backs to the camera. The road is lined with a tall green leafy tree, a palm tree, and a white fence along the sidewalk. A big pink constructing with a big round roof and a big pink constructing with a dome roof are visible in the space. A person wearing a black shirt and a gray bag is walking in front of the white motorcycle parked near the fence. A person wearing a red shirt and a black backpack is standing next to the food cart on the sidewalk. The food cart is parked on the side of the road. A person wearing a blue shirt has a white bag on his back and is walking on the sidewalk. The road light is black. The sky is blue. ✅	A gaggle of individuals walk down a street, their backs to the camera. A tall constructing with a sharp roof and a big pink and white constructing dominate the skyline. A lady in a black dress and a person in black pants walk side by side, while a person in a red shirt and a person in a blue shirt walk behind them. A motorbike is parked next to the fence, and a scooter is parked on the sidewalk. The person is walking, and the motorcycle is being driven by a person in a white shirt. The person is wearing a red shirt and a black pants, and the person within the blue shirt is riding a scooter. The sign on the pole is visible, and the person is wearing a backpack. ✅

Document Understanding

Input Image	Input Prompt	3B/448 Response	10B/448 Response
	For resolution-sensitive tasks, which variant is best?	448px but resized to 224px first	448px ✅
	What’s the targeted emission rate for France for 2023?	20 ✅	20 ✅

Localization Tasks

Now we have evaluated PaliGemma 2 mix variants in response to their localization-related capabilities. Given a prompt “detect {object description};{one other object description}” with different objects of interest, PaliGemma can detect different objects of interest. The prompt here is just not limited to short classes like “bird,” but it might probably be “bird on a stick”.

Below, you’ll find detection and segmentation outputs of various variants with a set resolution of 448×448. We zoom in on the thing of interest for visualization purposes.

Text Recognition in Images

Input Image	Input Prompt	3B/448 Response	10B/448 Response
	When is that this ticket dated and the way much did it cost?	26-05-2023 21:52 17.00 ✅	26-05-2023 17.00 ✅
	Read text	FRIDAY, DEC 20thnNEW OFFICE PARTYnCOCKTAIL MENU -nOFFICE MARTINInvodka fraise des bois – jus de framboise – liqueur de fleur de sureau – fleur wild strawberry vodka – raspberry puree – elderflower liquor – flowernDIFFUSERS SUNRISEntequila – mandarine impériale – jus d’orange sanguine – cointreau – cherry bitter tequila – tangerine liquor – blood orange juice – cointreau – cherry bitterngin infused à la mangue rôtie – citronnelle, kiwi vert & jaune – citron – poivre blanc roasted mango infused gin – lemongrass – green & yellow kiwi, lemon – white peppernTRANSFORMERS TWISTnpâte crème de cerise – caramel jamplémousse – bananasnPERUVIAN PEFTnpêches – cherry liquor – grapefruit cordial – pineapple ✅	FRIDAY, DEC twentieth NEW OFFICE PARTY COCKTAIL MENU – OFFICE MARTINI vodka fraise des bois – jus de framboise – liqueur de fleur de bureau – fleur wild strawberry vodka – raspberry puree – elderflower liqueur – flower DIFFUSERS SUN-HISE tequila – mandarine impériale – jus d’orange sanguine – cointreau – cherry bitter tequila – tangerine liquor – blood orange juice – cointreau – cherry bitter TRANSFORMERS TWIST gin infused à la mangue rôtie – citron vert & jaune – citron – poivre blanc roasted mango infused gin – lemongrass – green & yellow kiwi lemon – white pepper PERUVIAN PEFT piéce – eau de cèdre – eau de pamplemousse – ananas piece – cherry liquor – grapefruit vodka – pineapple ✅

Inference and Nice-tuning using Transformers

You need to use PaliGemma 2 mix models using transformers.

from transformers import (
    PaliGemmaProcessor,
    PaliGemmaForConditionalGeneration,
)
from transformers.image_utils import load_image
import torch

model_id = "google/paligemma2-10b-mix-224"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/predominant/transformers/tasks/automobile.jpg"
image = load_image(url)


model = PaliGemmaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto").eval()
processor = PaliGemmaProcessor.from_pretrained(model_id)


prompt = "describe en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to(model.device)
input_len = model_inputs["input_ids"].shape[-1]


with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)

Now we have an in-depth tutorial on tremendous tuning PaliGemma 2. The identical notebook may be used to tremendous tune the combination checkpoints as well.

Demo

We’re releasing a demo for a 10B model with 448×448 resolution. You may play with it below or head to app on this link.

Read and learn more about PaliGemma models below.

Acknowledgments

We would love to thank Sayak Paul and Vaibhav Srivastav for the review of this blog post. We thank the Google team for releasing this amazing, and open, model family.

Big because of Pablo Montalvo for integrating the model to transformers, and to Lysandre, Raushan, Arthur, Yih-Dar and the remainder of the team for reviewing, testing, and merging very quickly.

Source link

PaliGemma 2 Mix – Recent Instruction Vision Language Models by Google

Table of Contents

PaliGemma 2 Mix Models