Recent ViT and ALIGN Models From Kakao Brain

Kakao Brain and Hugging Face are excited to release a brand new open-source image-text dataset COYO of 700 million pairs and two latest visual language models trained on it, ViT and ALIGN. That is the primary time ever the ALIGN model is made public totally free and open-source use and the primary release of ViT and ALIGN models that include the train dataset.

Kakao Brain’s ViT and ALIGN models follow the identical architecture and hyperparameters as provided in the unique respective Google models but are trained on the open source COYO dataset. Google’s ViT and ALIGN models, while trained on huge datasets (ViT trained on 300 million images and ALIGN trained on 1.8 billion image-text pairs respectively), can’t be replicated since the datasets should not public. This contribution is especially priceless to researchers who wish to reproduce visual language modeling with access to the info as well. More detailed information on the Kakao ViT and ALIGN models could be found here.

This blog will introduce the brand new COYO dataset, Kakao Brain’s ViT and ALIGN models, and tips on how to use them! Listed here are the important takeaways:

First open-source ALIGN model ever!
First open ViT and ALIGN models which were trained on an open-source dataset COYO
Kakao Brain’s ViT and ALIGN models perform on par with the Google versions
ViT and ALIGN demos can be found on HF! You may play with the ViT and ALIGN demos online with image samples of your personal selection!

Performance Comparison

Kakao Brain’s released ViT and ALIGN models perform on par and sometimes higher than what Google has reported about their implementation. Kakao Brain’s ALIGN-B7-Base model, while trained on much fewer pairs (700 million pairs vs 1.8 billion), performs on par with Google’s ALIGN-B7-Base on the Image KNN classification task and higher on MS-COCO retrieval image-to-text, text-to-image tasks. Kakao Brain’s ViT-L/16 performs similarly to Google’s ViT-L/16 when evaluated on ImageNet and ImageNet-ReaL at model resolutions 384 and 512. This implies the community can use Kakao Brain’s ViT and ALIGN models to duplicate Google’s ViT and ALIGN releases especially when users require access to the training data. We’re excited to see open-source and transparent releases of those models that perform on par with the cutting-edge!

COYO DATASET

What’s special about these model releases is that the models are trained on the free and accessible COYO dataset. COYO is an image-text dataset of 700 million pairs just like Google’s ALIGN 1.8B image-text dataset which is a set of “noisy” alt-text and image pairs from webpages, but open-source. COYO-700M and ALIGN 1.8B are “noisy” because minimal filtering was applied. COYO is analogous to the opposite open-source image-text dataset, LAION but with the next differences. While LAION 2B is a much larger dataset of two billion English pairs, in comparison with COYO’s 700 million pairs, COYO pairs include more metadata that give users more flexibility and finer-grained control over usage. The next table shows the differences: COYO comes equipped with aesthetic scores for all pairs, more robust watermark scores, and face count data.

COYO	LAION 2B	ALIGN 1.8B
Image-text similarity rating calculated with CLIP ViT-B/32 and ViT-L/14 models, they’re provided as metadata but nothing is filtered out in order to avoid possible elimination bias	Image-text similarity rating supplied with CLIP (ViT-B/32) – only examples above threshold 0.28	Minimal, Frequency based filtering
NSFW filtering on images and text	NSFW filtering on images	Google Cloud API
Face recognition (face count) data provided as meta-data	No face recognition data	NA
700 million pairs all English	2 billion English	1.8 billion
From CC 2020 Oct – 2021 Aug	From CC 2014-2020	NA
Aesthetic Rating	Aesthetic Rating Partial	NA
More robust Watermark rating	Watermark Rating	NA
Hugging Face Hub	Hugging Face Hub	Not made public
English	English	English?

How ViT and ALIGN work

So what do these models do? Let’s briefly discuss how the ViT and ALIGN models work.

ViT — Vision Transformer — is a vision model proposed by Google in 2020 that resembles the text Transformer architecture.
It’s a brand new approach to vision, distinct from convolutional neural nets (CNNs) which have dominated vision tasks since 2012’s AlexNet. It’s upto 4 times more computationally efficient than similarly performing CNNs and domain agnostic. ViT takes as input a picture which is broken up right into a sequence of image patches – just because the text Transformer takes as input a sequence of text – and given position embeddings to every patch to learn the image structure. ViT performance is notable particularly for having a superb performance-compute trade-off. While a few of Google’s ViT models are open-source, the JFT-300 million image-label pair dataset they were trained on has not been released publicly. While Kakao Brain’s trained on COYO-Labeled-300M, which has been released publicly, and released ViT model performs similarly on various tasks, its code, model, and training data(COYO-Labeled-300M) are made entirely public for reproducibility and open science.

A Visualization of How ViT Works from Google Blog

Google then introduced ALIGN — a Large-scale Image and Noisy Text Embedding model in 2021 — a visual-language model trained on “noisy” text-image data for various vision and cross-modal tasks resembling text-image retrieval. ALIGN has an easy dual-encoder architecture trained on image and text pairs, learned via a contrastive loss function. ALIGN’s “noisy” training corpus is notable for balancing scale and robustness. Previously, visual language representational learning had been trained on large-scale datasets with manual labels, which require extensive preprocessing. ALIGN’s corpus uses the image alt-text data, text that appears when the image fails to load, because the caption to the image — leading to an inevitably noisy, but much larger (1.8 billion pair) dataset that permits ALIGN to perform at SoTA levels on various tasks. Kakao Brain’s ALIGN is the primary open-source version of this model, trained on the COYO dataset and performs higher than Google’s reported results.

ALIGN Model from Google Blog

Find out how to use the COYO dataset

We are able to conveniently download the COYO dataset with a single line of code using the 🤗 Datasets library. To preview the COYO dataset and learn more concerning the data curation process and the meta attributes included, head over to the dataset page on the hub or the unique Git repository. To start, let’s install the 🤗 Datasets library: pip install datasets and download it.

>>> from datasets import load_dataset

>>> dataset = load_dataset('kakaobrain/coyo-700m')
>>> dataset

While it’s significantly smaller than the LAION dataset, the COYO dataset remains to be massive with 747M image-text pairs and it is likely to be unfeasible to download the entire dataset to your local. So as to download only a subset of the dataset, we will simply pass within the streaming=True argument to the load_dataset() method to create an iterable dataset and download data instances as we go.

>>> from datasets import load_dataset

>>> dataset = load_dataset('kakaobrain/coyo-700m', streaming=True)
>>> print(next(iter(dataset['train'])))
{'id': 2680060225205, 'url': 'https://cdn.shopify.com/s/files/1/0286/3900/2698/products/TVN_Huile-olive-infuse-et-s-227x300_e9a90ffd-b6d2-4118-95a1-29a5c7a05a49_800x.jpg?v=1616684087', 'text': 'Olive oil infused with Tuscany herbs', 'width': 227, 'height': 300, 'image_phash': '9f91e133b1924e4e', 'text_length': 36, 'word_count': 6, 'num_tokens_bert': 6, 'num_tokens_gpt': 9, 'num_faces': 0, 'clip_similarity_vitb32': 0.19921875, 'clip_similarity_vitl14': 0.147216796875, 'nsfw_score_opennsfw2': 0.0058441162109375, 'nsfw_score_gantman': 0.018961310386657715, 'watermark_score': 0.11015450954437256, 'aesthetic_score_laion_v2': 4.871710777282715}

Find out how to use ViT and ALIGN from the Hub

Let’s go ahead and experiment with the brand new ViT and ALIGN models. As ALIGN is newly added to 🤗 Transformers, we are going to install the most recent version of the library: pip install -q git+https://github.com/huggingface/transformers.git and start with ViT for image classification by importing the modules and libraries we are going to use. Note that the newly added ALIGN model will likely be a component of the PyPI package in the following release of the library.

import requests
from PIL import Image
import torch
from transformers import ViTImageProcessor, ViTForImageClassification

Next, we are going to download a random image of two cats and distant controls on a couch from the COCO dataset and preprocess the image to rework it to the input format expected by the model. To do that, we will conveniently use the corresponding preprocessor class (ViTProcessor). To initialize the model and the preprocessor, we are going to use one in all the Kakao Brain ViT repos on the hub. Note that initializing the preprocessor from a repository ensures that the preprocessed image is within the expected format required by that specific pretrained model.

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = ViTImageProcessor.from_pretrained('kakaobrain/vit-large-patch16-384')
model = ViTForImageClassification.from_pretrained('kakaobrain/vit-large-patch16-384')

The remainder is straightforward, we are going to forward preprocess the image and use it as input to the model to retrieve the category logits. The Kakao Brain ViT image classification models are trained on ImageNet labels and output logits of shape (batch_size, 1000).


inputs = processor(images=image, return_tensors="pt")


with torch.no_grad():
    outputs = model(**inputs)


preds = torch.nn.functional.softmax(outputs.logits, dim=-1)


top_class_preds = torch.argsort(preds, descending=True)[0, :5]

for c in top_class_preds:
    print(f"{model.config.id2label[c.item()]} with probability {round(preds[0, c.item()].item(), 4)}")

And we’re done! To make things even easier and shorter, we also can use the convenient image classification pipeline and pass the Kakao Brain ViT repo name as our goal model to initialize the pipeline. We are able to then pass in a URL or an area path to a picture or a Pillow image and optionally use the top_k argument to return the highest k predictions. Let’s go ahead and get the highest 5 predictions for our image of cats and remotes.

>>> from transformers import pipeline

>>> classifier = pipeline(task='image-classification', model='kakaobrain/vit-large-patch16-384')
>>> classifier('http://images.cocodataset.org/val2017/000000039769.jpg', top_k=5)
[{'score': 0.8223727941513062, 'label': 'remote control, remote'}, {'score': 0.06580372154712677, 'label': 'tabby, tabby cat'}, {'score': 0.0655883178114891, 'label': 'tiger cat'}, {'score': 0.0388941615819931, 'label': 'Egyptian cat'}, {'score': 0.0011215205304324627, 'label': 'lynx, catamount'}]

If you would like to experiment more with the Kakao Brain ViT model, head over to its Space on the 🤗 Hub.

Let’s move on to experimenting with ALIGN, which could be used to retrieve multi-modal embeddings of texts or images or to perform zero-shot image classification. ALIGN’s transformers implementation and usage is analogous to CLIP. To start, we are going to first download the pretrained model and its processor, which may preprocess each the pictures and texts such that they’re within the expected format to be fed into the vision and text encoders of ALIGN. Once more, let’s import the modules we are going to use and initialize the preprocessor and the model.

import requests
from PIL import Image
import torch
from transformers import AlignProcessor, AlignModel


url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AlignProcessor.from_pretrained('kakaobrain/align-base')
model = AlignModel.from_pretrained('kakaobrain/align-base')

We are going to start with zero-shot image classification first. To do that, we are going to supply candidate labels (free-form text) and use AlignModel to search out out which description higher describes the image. We are going to first preprocess each the image and text inputs and feed the preprocessed input to the AlignModel.

candidate_labels = ['an image of a cat', 'an image of a dog']

inputs = processor(images=image, text=candidate_labels, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)


logits_per_image = outputs.logits_per_image  


probs = logits_per_image.softmax(dim=1)  
print(probs)

Done, easy as that. To experiment more with the Kakao Brain ALIGN model for zero-shot image classification, simply head over to its demo on the 🤗 Hub. Note that, the output of AlignModel includes text_embeds and image_embeds (see the documentation of ALIGN). If we need not compute the per-image and per-text logits for zero-shot classification, we will retrieve the vision and text embeddings using the convenient get_image_features() and get_text_features() methods of the AlignModel class.

text_embeds = model.get_text_features(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    token_type_ids=inputs['token_type_ids'],
)
image_embeds = model.get_image_features(
    pixel_values=inputs['pixel_values'],
)

Alternatively, we will use the stand-along vision and text encoders of ALIGN to retrieve multi-modal embeddings. These embeddings can then be used to coach models for various downstream tasks resembling object detection, image segmentation and image captioning. Let’s examine how we will retrieve these embeddings using AlignTextModel and AlignVisionModel. Note that we will use the convenient AlignProcessor class to preprocess texts and pictures individually.

from transformers import AlignTextModel


processor = AlignProcessor.from_pretrained('kakaobrain/align-base')
model = AlignTextModel.from_pretrained('kakaobrain/align-base')


inputs = processor(['an image of a cat', 'an image of a dog'], return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)


last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output

We also can opt to return all hidden states and a spotlight values by setting the output_hidden_states and output_attentions arguments to True during inference.

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True, output_attentions=True)


for key, value in outputs.items():
    print(key)

Let’s do the identical with AlignVisionModel and retrieve the multi-modal embedding of a picture.

from transformers import AlignVisionModel


processor = AlignProcessor.from_pretrained('kakaobrain/align-base')
model = AlignVisionModel.from_pretrained('kakaobrain/align-base')

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors='pt')

with torch.no_grad():
    outputs = model(**inputs)


last_hidden_state = outputs.last_hidden_state
pooled_output = outputs.pooler_output

Just like ViT, we will use the zero-shot image classification pipeline to make our work even easier. Let’s examine how we will use this pipeline to perform image classification within the wild using free-form text candidate labels.

>>> from transformers import pipeline

>>> classifier = pipeline(task='zero-shot-image-classification', model='kakaobrain/align-base')
>>> classifier(
...     'https://huggingface.co/datasets/Narsil/image_dummy/raw/important/parrots.png',
...     candidate_labels=['animals', 'humans', 'landscape'],
... )
[{'score': 0.9263709783554077, 'label': 'animals'}, {'score': 0.07163811475038528, 'label': 'humans'}, {'score': 0.0019908479880541563, 'label': 'landscape'}]

>>> classifier(
...    'https://huggingface.co/datasets/Narsil/image_dummy/raw/important/parrots.png',
...    candidate_labels=['black and white', 'photorealist', 'painting'],
... )
[{'score': 0.9735308885574341, 'label': 'black and white'}, {'score': 0.025493400171399117, 'label': 'photorealist'}, {'score': 0.0009757201769389212, 'label': 'painting'}]

Conclusion

There have been incredible advances in multi-modal models in recent times, with models resembling CLIP and ALIGN unlocking various downstream tasks resembling image captioning, zero-shot image classification, and open vocabulary object detection. On this blog, we talked concerning the latest open source ViT and ALIGN models contributed to the Hub by Kakao Brain, in addition to the brand new COYO text-image dataset. We also showed how you should utilize these models to perform various tasks with a couple of lines of code each on their very own or as a component of 🤗 Transformers pipelines.

That was it! We’re continuing to integrate probably the most impactful computer vision and multi-modal models and would love to listen to back from you. To not sleep thus far with the most recent news in computer vision and multi-modal research, you possibly can follow us on Twitter: @adirik, @a_e_roberts, @NielsRogge, @RisingSayak, and @huggingface.

Source link

Recent ViT and ALIGN Models From Kakao Brain

Performance Comparison

COYO DATASET

How ViT and ALIGN work

Find out how to use the COYO dataset

Find out how to use ViT and ALIGN from the Hub

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Effective-tuning 20B LLMs with RLHF on a 24GB consumer GPU

On the Possibility of Small Networks for Physics-Informed Learning

Multivariate Probabilistic Time Series Forecasting with Informer

The philosophical puzzle of rational artificial intelligence

AI agents now have their very own Reddit-style social network, and it’s getting weird fast

Recent ViT and ALIGN Models From Kakao Brain

Performance Comparison

COYO DATASET

How ViT and ALIGN work

Find out how to use the COYO dataset

Find out how to use ViT and ALIGN from the Hub

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.