a scalable, fully-attentional model that works on any modality

TLDR

We have added Perceiver IO to Transformers, the primary Transformer-based neural network that works on every kind of modalities (text, images, audio, video, point clouds,…) and combos thereof. Take a take a look at the next Spaces to view some examples:

We also provide several notebooks.

Below, you will discover a technical explanation of the model.

Introduction

The Transformer, originally introduced by
Vaswani et al. in 2017, caused a revolution within the AI community, initially improving
state-of-the-art (SOTA) ends in machine translation. In 2018, BERT
was released, a Transformer encoder-only model that crushed the benchmarks of natural language
processing (NLP), most famously the GLUE benchmark.

Not long after that, AI researchers began to apply the thought of BERT to other domains. To call just a few examples:

In all of those domains, state-of-the-art results were improved dramatically, because of the mixture of this powerful architecture with large-scale pre-training.

Nevertheless, there’s a crucial limitation to the architecture of the Transformer: as a consequence of its self-attention mechanism, it scales very poorly in each compute and memory. In every layer, all inputs are used to provide queries and keys, for which a pairwise dot product is computed. Hence, it will not be possible to use self-attention on high-dimensional data without some type of preprocessing. Wav2Vec2, for instance, solves this by employing a feature encoder to show a raw waveform right into a sequence of time-based features. The Vision Transformer (ViT) divides a picture right into a sequence of non-overlapping patches, which function “tokens”. The Video Vision Transformer (ViViT) extracts non-overlapping, spatio-temporal
“tubes” from a video, which function “tokens”. To make the Transformer work on a selected modality, one typically discretizes it to a sequence of tokens to make it work.

The Perceiver

The Perceiver goals to unravel this limitation by employing the self-attention mechanism on a set of latent variables, moderately than on the inputs. The inputs (which could possibly be text, image, audio, video) are only used for doing cross-attention with the latents. This has the advantage that the majority of compute happens in a latent space, where compute is reasonable (one typically uses 256 or 512 latents). The resulting architecture has no quadratic dependence on the input size: the Transformer encoder only depends linearly on the input size, while latent attention is independent of it. In a follow-up paper, called Perceiver IO, the authors extend this concept to let the Perceiver also handle arbitrary outputs. The concept is analogous: one only uses the outputs for doing cross-attention with the latents. Note that I’ll use the terms “Perceiver” and “Perceiver IO” interchangeably to discuss with the Perceiver IO model throughout this blog post.

In the next section, we glance in a bit more detail at how Perceiver IO actually works by going over its implementation in HuggingFace Transformers, a well-liked library that originally implemented Transformer-based models for NLP, but is now beginning to implement them for other domains as well. Within the sections below, we explain intimately – when it comes to shapes of tensors – how the Perceiver actually pre and post processes modalities of any kind.

All Perceiver variants in HuggingFace Transformers are based on the PerceiverModel class. To initialize a PerceiverModel, one can provide 3 additional instances to the model:

a preprocessor
a decoder
a postprocessor.

Note that every of those are optional. A preprocessor is barely required in case one hasn’t already embedded the inputs (resembling text, image, audio, video) themselves. A decoder is barely required in case one desires to decode the output of the Perceiver encoder (i.e. the last hidden states of the latents) into something more useful, resembling classification logits or optical flow. A postprocessor is barely required in case one desires to turn the output of the decoder into a particular feature (this is barely required when doing auto-encoding, as we’ll see further). An summary of the architecture is depicted below.

The Perceiver architecture.

In other words, the inputs (which could possibly be any modality, or a mix thereof) are first optionally preprocessed using a preprocessor. Next, the preprocessed inputs perform a cross-attention operation with the latent variables of the Perceiver encoder. On this operation, the latent variables produce queries (Q), while the preprocessed inputs produce keys and values (KV). After this operation, the Perceiver encoder employs a (repeatable) block of self-attention layers to update the embeddings of the latents. The encoder will finally produce a tensor of shape (batch_size, num_latents, d_latents), containing the last hidden states of the latents. Next, there’s an optional decoder, which might be used to decode the ultimate hidden states of the latents into something more useful, resembling classification logits. This is finished by performing a cross-attention operation, during which trainable embeddings are used to provide queries (Q), while the latents are used to provide keys and values (KV). Finally, there’s an optional postprocessor, which might be used to postprocess the decoder outputs to specific features.

Let’s start off by showing how the Perceiver is implemented to work on text.

Perceiver for text

Suppose that one desires to apply the Perceiver to perform text classification. Because the memory and time requirements of the Perceiver’s self-attention mechanism don’t rely upon the scale of the inputs, one can directly provide raw UTF-8 bytes to the model. This is useful, as familar Transformer-based models (like BERT and RoBERTa) all employ some type of explicit tokenization, resembling WordPiece, BPE or SentencePiece, which could also be harmful. For a good comparison to BERT (which uses a sequence length of 512 subword tokens), the authors used input sequences of 2048 bytes. For instance one also adds a batch dimension, then the inputs to the model are of shape (batch_size, 2048). The inputs contain the byte IDs (just like the input_ids of BERT) for a single piece of text. One can use PerceiverTokenizer to show a text right into a sequence of byte IDs, padded as much as a length of 2048:

from transformers import PerceiverTokenizer

tokenizer = PerceiverTokenizer.from_pretrained("deepmind/language-perceiver")

text = "hello world"

inputs = tokenizer(text, padding="max_length", return_tensors="pt").input_ids

On this case, one provides PerceiverTextPreprocessor as preprocessor to the model, which is able to handle embedding the inputs (i.e. turn each byte ID right into a corresponding vector), in addition to adding absolute position embeddings. As decoder, one provides PerceiverClassificationDecoder to the model (which is able to turn the last hidden states of the latents into classification logits). No postprocessor is required. In other words, a Perceiver model for text classification (which is named PerceiverForSequenceClassification in HuggingFace Transformers) is implemented as follows:

from torch import nn
from transformers import PerceiverModel
from transformers.models.perceiver.modeling_perceiver import PerceiverTextPreprocessor, PerceiverClassificationDecoder

class PerceiverForSequenceClassification(nn.Module):
    def __init__(self, config):
        super().__init__(config)

        self.perceiver = PerceiverModel(
            config,
            input_preprocessor=PerceiverTextPreprocessor(config),
            decoder=PerceiverClassificationDecoder(
                config,
                num_channels=config.d_latents,
                trainable_position_encoding_kwargs=dict(num_channels=config.d_latents, index_dims=1),
                use_query_residual=True,
            ),
        )

One can already see here that the decoder is initialized with trainable position encoding arguments. Why is that? Well, let’s have a look intimately at how Perceiver IO works. At initialization, PerceiverModel internally defines a set of latent variables, as follows:

from torch import nn

self.latents = nn.Parameter(torch.randn(config.num_latents, config.d_latents))

Within the Perceiver IO paper, one uses 256 latents, and sets the dimensionality of the latents to 1280. If one also adds a batch dimension, the Perceiver has latents of shape (batch_size, 256, 1280). First, the preprocessor (which one provides at initialization) will handle embedding the UTF-8 byte IDs to embedding vectors. Hence, PerceiverTextPreprocessor will turn the inputs of shape (batch_size, 2048) to a tensor of shape (batch_size, 2048, 768) – assuming that every byte ID is changed into a vector of size 768 (this is decided by the d_model attribute of PerceiverConfig).

After this, Perceiver IO applies cross-attention between the latents (which produce queries) of shape (batch_size, 256, 1280) and the preprocessed inputs (which produce keys and values) of shape (batch_size, 2048, 768). The output of this initial cross-attention operation is a tensor that has the identical shape because the queries (that are the latents, on this case). In other words, the output of the cross-attention operation is of shape (batch_size, 256, 1280).

Next, a (repeatable) block of self-attention layers is applied to update the representations of the latents. Note that these don’t rely upon the length of the inputs (i.e. the bytes) one provided, as these were only used in the course of the cross-attention operation. Within the Perceiver IO paper, a single block of 26 self-attention layers (each of which has 8 attention heads) were used to update the representations of the latents of the text model. Note that the output after these 26 self-attention layers still has the identical shape as what one initially provided as input to the encoder: (batch_size, 256, 1280). These are also called the “last hidden states” of the latents. This could be very just like the “last hidden states” of the tokens one provides to BERT.

Okay, so now one has final hidden states of shape (batch_size, 256, 1280). Great, but one actually desires to turn these into classification logits of shape (batch_size, num_labels). How can we make the Perceiver output these?

That is handled by PerceiverClassificationDecoder. The concept could be very just like what was done when mapping the inputs to the latent space: one uses cross-attention. But now, the latent variables will produce keys and values, and one provides a tensor of whatever shape we would like – on this case we’ll provide a tensor of shape (batch_size, 1, num_labels) which is able to act as queries (the authors discuss with these as “decoder queries”, because they’re utilized in the decoder). This tensor will likely be randomly initialized at the start of coaching, and trained end-to-end. As one can see, one just provides a dummy sequence length dimension of 1. Note that the output of a QKV attention layer all the time has the identical shape as the form of the queries – hence the decoder will output a tensor of shape (batch_size, 1, num_labels). The decoder then simply squeezes this tensor to have shape (batch_size, num_labels) and boom, one has classification logits¹.

Great, is not it? The Perceiver authors also show that it is easy to pre-train the Perceiver for masked language modeling, just like BERT. This model can be available in HuggingFace Transformers, and called PerceiverForMaskedLM. The one difference with PerceiverForSequenceClassification is that it doesn’t use PerceiverClassificationDecoder as decoder, but moderately PerceiverBasicDecoder, to decode the latents to a tensor of shape (batch_size, 2048, 1280). After this, a language modeling head is added, which turns it right into a tensor of shape (batch_size, 2048, vocab_size). The vocabulary size of the Perceiver is barely 262, namely the 256 UTF-8 byte IDs, in addition to 6 special tokens. By pre-training the Perceiver on English Wikipedia and C4, the authors show that it is feasible to realize an overall rating of 81.8 on GLUE after fine-tuning.

Perceiver for images

Now that we have seen find out how to apply the Perceiver to perform text classification, it is easy to use the Perceiver to do image classification. The one difference is that we’ll provide a special preprocessor to the model, which is able to embed the image inputs. The Perceiver authors actually tried out 3 alternative ways of preprocessing:

flattening the pixel values, applying a convolutional layer with kernel size 1 and adding learned absolute 1D position embeddings.
flattening the pixel values and adding fixed 2D Fourier position embeddings.
applying a 2D convolutional + maxpool layer and adding fixed 2D Fourier position embeddings.

Each of those are implemented within the Transformers library, and called PerceiverForImageClassificationLearned, PerceiverForImageClassificationFourier and PerceiverForImageClassificationConvProcessing respectively. They only differ of their configuration of PerceiverImagePreprocessor. Let’s take a more in-depth take a look at PerceiverForImageClassificationLearned. It initializes a PerceiverModel as follows:

from torch import nn
from transformers import PerceiverModel
from transformers.models.perceiver.modeling_perceiver import PerceiverImagePreprocessor, PerceiverClassificationDecoder

class PerceiverForImageClassificationLearned(nn.Module):
    def __init__(self, config):
        super().__init__(config)

        self.perceiver = PerceiverModel(
            config,
            input_preprocessor=PerceiverImagePreprocessor(
                config,
                prep_type="conv1x1",
                spatial_downsample=1,
                out_channels=256,
                position_encoding_type="trainable",
                concat_or_add_pos="concat",
                project_pos_dim=256,
                trainable_position_encoding_kwargs=dict(num_channels=256, index_dims=config.image_size ** 2),
            ),
            decoder=PerceiverClassificationDecoder(
                config,
                num_channels=config.d_latents,
                trainable_position_encoding_kwargs=dict(num_channels=config.d_latents, index_dims=1),
                use_query_residual=True,
            ),
        )

One can see that PerceiverImagePreprocessor is initialized with prep_type = "conv1x1" and that one adds arguments for the trainable position encodings. So how does this preprocessor work intimately? Suppose that one provides a batch of images to the model. For instance one applies center cropping to a resolution of 224 and normalization of the colour channels first, such that the inputs are of shape (batch_size, num_channels, height, width) = (batch_size, 3, 224, 224). One can use PerceiverImageProcessor for this, as follows:

from transformers import PerceiverImageProcessor
import requests
from PIL import Image

processor = PerceiverImageProcessor.from_pretrained("deepmind/vision-perceiver")

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(image, return_tensors="pt").pixel_values

PerceiverImagePreprocessor (with the settings defined above) will first apply a convolutional layer with kernel size (1, 1) to show the inputs right into a tensor of shape (batch_size, 256, 224, 224) – hence increasing the channel dimension. It is going to then place the channel dimension last – so now one has a tensor of shape (batch_size, 224, 224, 256). Next, it flattens the spatial (height + width) dimensions such that one has a tensor of shape (batch_size, 50176, 256). Next, it concatenates it with trainable 1D position embeddings. Because the dimensionality of the position embeddings is defined to be 256 (see the num_channels argument above), one is left with a tensor of shape (batch_size, 50176, 512). This tensor will likely be used for the cross-attention operation with the latents.

The authors use 512 latents for all image models, and set the dimensionality of the latents to 1024. Hence, the latents are a tensor of shape (batch_size, 512, 1024) – assuming we add a batch dimension. The cross-attention layer takes the queries of shape (batch_size, 512, 1024) and keys + values of shape (batch_size, 50176, 512) as input, and produces a tensor that has the identical shape because the queries, so outputs a brand new tensor of shape (batch_size, 512, 1024). Next, a block of 6 self-attention layers is applied repeatedly (8 times), to provide final hidden states of the latents of shape (batch_size, 512, 1024). To show these into classification logits, PerceiverClassificationDecoder is used, which works similarly to the one for text classification: it uses the latents as keys + values, and uses trainable position embeddings of shape (batch_size, 1, num_labels) as queries. The output of the cross-attention operation is a tensor of shape (batch_size, 1, num_labels), which is squeezed to have classification logits of shape (batch_size, num_labels).

The Perceiver authors show that the model is able to achieving strong results in comparison with models designed primarily for image classification (resembling ResNet or ViT). After large-scale pre-training on JFT, the model that uses conv+maxpool preprocessing (PerceiverForImageClassificationConvProcessing) achieves 84.5 top-1 accuracy on ImageNet. Remarkably, PerceiverForImageClassificationLearned, the model that only employs a 1D fully learned position encoding, achieves a top-1 accuracy of 72.7 despite having no privileged information in regards to the 2D structure of images.

Perceiver for optical flow

The authors show that it’s straightforward to make the Perceiver also work on optical flow, which is a decades-old problem in computer vision, with many broader applications. For an introduction to optical flow, I discuss with this blog post. Given two images of the identical scene (e.g. two consecutive frames of a video), the duty is to estimate the 2D displacement for every pixel in the primary image. Existing algorithms are quite hand-engineered and complicated, nonetheless with the Perceiver, this becomes relatively easy. The model is implemented within the Transformers library, and available as PerceiverForOpticalFlow. It’s implemented as follows:

from torch import nn
from transformers import PerceiverModel
from transformers.models.perceiver.modeling_perceiver import PerceiverImagePreprocessor, PerceiverOpticalFlowDecoder

class PerceiverForOpticalFlow(nn.Module):
    def __init__(self, config):
        super().__init__(config)

        fourier_position_encoding_kwargs_preprocessor = dict(
            num_bands=64,
            max_resolution=config.train_size,
            sine_only=False,
            concat_pos=True,
        )
        fourier_position_encoding_kwargs_decoder = dict(
            concat_pos=True, max_resolution=config.train_size, num_bands=64, sine_only=False
        )
        
        image_preprocessor = PerceiverImagePreprocessor(
            config,
            prep_type="patches",
            spatial_downsample=1,
            conv_after_patching=True,
            conv_after_patching_in_channels=54,
            temporal_downsample=2,
            position_encoding_type="fourier",
            
            fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_preprocessor,
        )
        
        self.perceiver = PerceiverModel(
            config,
            input_preprocessor=image_preprocessor,
            decoder=PerceiverOpticalFlowDecoder(
                config,
                num_channels=image_preprocessor.num_channels,
                output_image_shape=config.train_size,
                rescale_factor=100.0,
                use_query_residual=False,
                output_num_channels=2,
                position_encoding_type="fourier",
                fourier_position_encoding_kwargs=fourier_position_encoding_kwargs_decoder,
            ),
        )

As one can see, PerceiverImagePreprocessor is used as preprocessor (i.e. to arrange the two images for the cross-attention operation with the latents) and PerceiverOpticalFlowDecoder is used as decoder (i.e. to decode the ultimate hidden states of the latents to an actual predicted flow). For every of the two frames, the authors extract a 3 x 3 patch around each pixel, resulting in 3 x 3 x 3 = 27 values for every pixel (as each pixel also has 3 color channels). The authors use a training resolution of (368, 496). If one stacks 2 frames of size (368, 496) of every training example on top of one another, the inputs to the model are of shape (batch_size, 2, 27, 368, 496).

The preprocessor (with the settings defined above) will first concatenate the frames along the channel dimension, resulting in a tensor of shape (batch_size, 368, 496, 54) – assuming one also moves the channel dimension to be last. The authors explain of their paper (page 8) why concatenation along the channel dimension is sensible. Next, the spatial dimensions are flattened, resulting in a tensor of shape (batch_size, 368*496, 54) = (batch_size, 182528, 54). Then, position embeddings (each of which have dimensionality 258) are concatenated, resulting in a final preprocessed input of shape (batch_size, 182528, 322). These will likely be used to perform cross-attention with the latents.

The authors use 2048 latents for the optical flow model (yes, 2048!), with a dimensionality of 512 for every latent. Hence, the latents have shape (batch_size, 2048, 512). After the cross-attention, one again has a tensor of the identical shape (because the latents act as queries). Next, a single block of 24 self-attention layers (each of which has 16 attention heads) are applied to update the embeddings of the latents.

To decode the ultimate hidden states of the latents to an actual predicted flow, PerceiverOpticalFlowDecoder simply uses the preprocessed inputs of shape (batch_size, 182528, 322) as queries for the cross-attention operation. Next, these are projected to a tensor of shape (batch_size, 182528, 2). Finally, one rescales and reshapes this back to the unique image size to get a predicted flow of shape (batch_size, 368, 496, 2). The authors claim state-of-the-art results on necessary benchmarks including Sintel and KITTI when training on AutoFlow, a big synthetic dataset of 400,000 annotated image pairs.

The video below shows the expected flow on 2 examples.

Optical flow estimation by Perceiver IO. The color of every pixel shows the direction and speed of motion estimated by the model, as indicated by the legend on the correct.

Perceiver for multimodal autoencoding

The authors also use the Perceiver for multimodal autoencoding. The goal of multimodal autoencoding is to learn a model that may accurately reconstruct multimodal inputs within the presence of a bottleneck induced by an architecture. The authors train the model on the Kinetics-700 dataset, during which each example consists of a sequence of images (i.e. frames), audio and a category label (considered one of 700 possible labels). This model can be implemented in HuggingFace Transformers, and available as PerceiverForMultimodalAutoencoding. For brevity, I’ll omit the code of defining this model, but necessary to notice is that it uses PerceiverMultimodalPreprocessor to arrange the inputs for the model. This preprocessor will first use the respective preprocessor for every modality (image, audio, label) individually. Suppose one has a video of 16 frames of resolution 224×224 and 30,720 audio samples, then the modalities are preprocessed as follows:

The photographs – actually a sequence of frames – of shape (batch_size, 16, 3, 224, 224) are changed into a tensor of shape (batch_size, 50176, 243) using PerceiverImagePreprocessor. It is a “space to depth” transformation, after which fixed 2D Fourier position embeddings are concatenated.
The audio has shape (batch_size, 30720, 1) and is changed into a tensor of shape (batch_size, 1920, 401) using PerceiverAudioPreprocessor (which concatenates fixed Fourier position embeddings to the raw audio).
The category label of shape (batch_size, 700) is changed into a tensor of shape (batch_size, 1, 700) using PerceiverOneHotPreprocessor. In other words, this preprocessor just adds a dummy time (index) dimension. Note that one initializes the category label with a tensor of zeros during evaluation, in order to let the model act as a video classifier.

Next, PerceiverMultimodalPreprocessor will pad the preprocessed modalities with modality-specific trainable embeddings to make concatenation along the time dimension possible. On this case, the modality with the best channel dimension is the category label (it has 700 channels). The authors implement a minimum padding size of 4, hence each modality will likely be padded to have 704 channels. They’ll then be concatenated, hence the ultimate preprocessed input is a tensor of shape (batch_size, 50176 + 1920 + 1, 704) = (batch_size, 52097, 704).

The authors use 784 latents, with a dimensionality of 512 for every latent. Hence, the latents have shape (batch_size, 784, 512). After the cross-attention, one again has a tensor of the identical shape (because the latents act as queries). Next, a single block of 8 self-attention layers (each of which has 8 attention heads) is applied to update the embeddings of the latents.

Next, there’s PerceiverMultimodalDecoder, which is able to first create output queries for every modality individually. Nevertheless, because it will not be possible to decode a whole video in a single forward pass, the authors as an alternative auto-encode in chunks. Each chunk will subsample certain index dimensions for each modality. For instance we process the video in 128 chunks, then the decoder queries will likely be produced as follows:

For the image modality, the whole size of the decoder query is 16x3x224x224 = 802,816. Nevertheless, when auto-encoding the primary chunk, one subsamples the primary 802,816/128 = 6272 values. The form of the image output query is (batch_size, 6272, 195) – the 195 comes from the indisputable fact that fixed Fourier position embeddings are used.
For the audio modality, the whole input has 30,720 values. Nevertheless, one only subsamples the primary 30720/128/16 = 15 values. Hence, the form of the audio query is (batch_size, 15, 385). Here, the 385 comes from the indisputable fact that fixed Fourier position embeddings are used.
For the category label modality, there is not any have to subsample. Hence, the subsampled index is ready to 1. The form of the label output query is (batch_size, 1, 1024). One uses trainable position embeddings (of size 1024) for the queries.

Similarly to the preprocessor, PerceiverMultimodalDecoder pads different modalities to the identical variety of channels, to make concatenation of the modality-specific queries possible along the time dimension. Here, the category label has again the best variety of channels (1024), and the authors implement a minimum padding size of two, hence every modality will likely be padded to have 1026 channels. After concatenation, the ultimate decoder query has shape (batch_size, 6272 + 15 + 1, 1026) = (batch_size, 6288, 1026). This tensor produces queries within the cross-attention operation, while the latents act as keys and values. Hence, the output of the cross-attention operation is a tensor of shape (batch_size, 6288, 1026). Next, PerceiverMultimodalDecoder employs a linear layer to cut back the output channels to get a tensor of shape (batch_size, 6288, 512).

Finally, there’s PerceiverMultimodalPostprocessor. This class postprocesses the output of the decoder to provide an actual reconstruction of every modality. It first splits up the time dimension of the decoder output based on different modalities: (batch_size, 6272, 512) for image, (batch_size, 15, 512) for audio and (batch_size, 1, 512) for the category label. Next, the respective postprocessors for every modality are applied:

The image post processor (which is named PerceiverProjectionPostprocessor in Transformers) simply turns the (batch_size, 6272, 512) tensor right into a tensor of shape (batch_size, 6272, 3) – i.e. it projects the ultimate dimension to RGB values.
PerceiverAudioPostprocessor turns the (batch_size, 15, 512) tensor right into a tensor of shape (batch_size, 240).
PerceiverClassificationPostprocessor simply takes the primary (and only index), to get a tensor of shape (batch_size, 700).

So now one finally ends up with tensors containing the reconstruction of the image, audio and sophistication label modalities respectively. As one auto-encodes a whole video in chunks, one must concatenate the reconstruction of every chunk to have a final reconstruction of a whole video. The figure below shows an example:

Above: original video (left), reconstruction of the primary 16 frames (right). Video taken from the UCF101 dataset. Below: reconstructed audio (taken from the paper).

Top 5 predicted labels for the video above. By masking the category label, the Perceiver becomes a video classifier.

With this approach, the model learns a joint distribution across 3 modalities. The authors do note that since the latent variables are shared across modalities and never explicitly allocated between them, the standard of reconstructions for every modality is sensitive to the burden of its loss term and other training hyperparameters. By putting stronger emphasis on classification accuracy, they’re able to reach 45% top-1 accuracy while maintaining 20.7 PSNR (peak signal-to-noise ratio) for video.

Other applications of the Perceiver

Note that there are not any limits on the applications of the Perceiver! In the unique Perceiver paper, the authors showed that the architecture might be used to process 3D point clouds – a standard concern for self-driving cars equipped with Lidar sensors. They trained the model on ModelNet40, a dataset of point clouds derived from 3D triangular meshes spanning 40 object categories. The model was shown to realize a top-1 accuracy of 85.7 % on the test set, competing with PointNet++, a highly specialized model that uses extra geometric features and performs more advanced augmentations.

The authors also used the Perceiver to exchange the unique Transformer in AlphaStar, the state-of-the-art reinforcement learning system for the complex game of StarCraft II. Without tuning any additional parameters, the authors observed that the resulting agent reached the identical level of performance as the unique AlphaStar agent, reaching an 87% win-rate versus the Elite bot after behavioral cloning on human data.

It’s important to notice that the models currently implemented (resembling PerceiverForImageClassificationLearned, PerceiverForOpticalFlow) are only examples of what you’ll be able to do with the Perceiver. Each of those are different instances of PerceiverModel, just with a special preprocessor and/or decoder (and optionally, a postprocessor as is the case for multimodal autoencoding). People can give you recent preprocessors, decoders and postprocessors to make the model solve different problems. As an illustration, one could extend the Perceiver to perform named-entity recognition (NER) or question-answering just like BERT, audio classification just like Wav2Vec2 or object detection just like DETR.

Conclusion

On this blog post, we went over the architecture of Perceiver IO, an extension of the Perceiver by Google Deepmind, and showed its generality of handling every kind of modalities. The massive advantage of the Perceiver is that the compute and memory requirements of the self-attention mechanism don’t rely upon the scale of the inputs and outputs, as the majority of compute happens in a latent space (a not-too large set of vectors). Despite its task-agnostic architecture, the model is capabable of achieving great results on modalities resembling language, vision, multimodal data, and point clouds. In the long run, it is perhaps interesting to coach a single (shared) Perceiver encoder on several modalities at the identical time, and use modality-specific preprocessors and postprocessors. As Karpathy puts it, it may possibly be that this architecture can unify all modalities right into a shared space, with a library of encoders/decoders.

Speaking of a library, the model is accessible in HuggingFace Transformers as of today. It is going to be exciting to see what people construct with it, as its applications seem countless!

Appendix

The implementation in HuggingFace Transformers is predicated on the unique JAX/Haiku implementation which might be found here.

The documentation of the Perceiver IO model in HuggingFace Transformers is accessible here.

Tutorial notebooks regarding the Perceiver on several modalities might be found here.

Footnotes

1 Note that within the official paper, the authors used a two-layer MLP to generate the output logits, which was omitted here for brevity. ↩

Source link

a scalable, fully-attentional model that works on any modality

TLDR

Introduction

The Perceiver

Perceiver for text

Perceiver for images

Perceiver for optical flow

Perceiver for multimodal autoencoding

Other applications of the Perceiver

Conclusion

Appendix

Footnotes

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Donkeys, Not Unicorns

GGML and llama.cpp join HF to make sure the long-term progress of Local AI

An End-to-End Guide to Beautifying Your Open-Source Repo with Agentic AI

Gradio is joining Hugging Face!

From Monolith to Contract-Driven Data Mesh

a scalable, fully-attentional model that works on any modality

TLDR

Introduction

The Perceiver

Perceiver for text

Perceiver for images

Perceiver for optical flow

Perceiver for multimodal autoencoding

Other applications of the Perceiver

Conclusion

Appendix

Footnotes

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.