This is an element 4 of my latest multi-part series 🐍 Towards Mamba State Space Models for Images, Videos and Time Series.
The field of computer vision has seen incredible advances lately. Considered one of the important thing enablers for this development has been undoubtedly the introduction of the Transformer. While the Transformer has revolutionized natural language processing, it took us some years to transfer its capabilities to the vision domain. Probably essentially the most distinguished paper was the Vision Transformer (ViT), a model that continues to be used because the backbone in lots of the trendy architectures.
It’s again the Transformer’s O(L²) complexity that limits its application because the image’s resolution grows. Being equipped with the Mamba selective state space model, we at the moment are in a position to let history repeat itself and transfer the success of SSMs from sequence data to non-sequence data: Images.
❗ Spoiler Alert: VisionMamba is 2.8x faster than DeiT and saves 86.8% GPU memory on high-resolution images (1248×1248) and in this text, you’ll see how…