With the discharge of the Aya Vision family, our latest 8B and 32B parameter vision-language models (VLMs), we’re addressing one in every of the largest challenges in AI: bringing multilingual performance to multimodal models.
Aya Vision is Cohere For AI‘s latest open-weight multilingual and multimodal model family, designed to be a powerful foundation for language and vision understanding across 23 languages. It builds on the success of Aya Expanse, state-of-the-art multilingual language models, and extends it using a mix of advanced techniques. These include synthetic annotations, scaling up multilingual data through translation and rephrasing, and multimodal model merging – key methods that improve each language and vision understanding in a multilingual setting.
In consequence, our models perform well in quite a lot of tasks, including image captioning, visual query answering, text generation, and translating each text and pictures into clear, natural-language text. We evaluated Aya Vision models on a set of datasets, including our latest open-ended vision-language benchmark AyaVisionBench and a multilingual version of Wild Vision Bench (mWildVision) that’s translated into 23 languages, which we release each of them for research.
In pair-wise comparison, Aya Vision 32B outperforms models greater than 2x of its size, equivalent to Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B by win rates starting from 50% to 64% on AyaVisionBench and 52% to 72% on mWildVision average across 23 languages.
Our compact and more efficient model Aya Vision 8B achieves one of the best performance in multilingual multimodal in its parameter class, outperforming leading models equivalent to Qwen2.5-VL 7B, Pixtral 12B, Gemini Flash 1.5 8B, Llama-3.2 11B Vision, Molmo-D 7B, and Pangea 7B by as much as 79% win-rates on AyaVisionBench and 81% on mWildBench.
We release each 8B and 32B models as open weights for the research community to further speed up multilingual multimodal progress. On this blog post, we share the important thing technical details behind Aya Vision models
Aya Vision Architecture and Training
For a high-performance vision-language model, it will be significant to process images with arbitrary resolutions, especially high-resolution images. To enable this capability in Aya Vision, we dynamically resize and split any higher-resolution images into multiple tiles to generate wealthy image features from the image encoder. In Aya Vision models, we use the recently released SigLIP2-patch14-384 model because the initialization for the vision encoder.
While dynamic resizing enables processing high-resolution images, it also results in a bigger variety of image tokens passing through the vision-language connector and LLM decoder. To enhance latency and throughput, we use a downsampling method called Pixel Shuffle, to compress the variety of image tokens by 4x. After downsampling, image tokens are aligned to the language model input embeddings through a vision-language connector and passed to an LLM decoder.
For the text decoder, we use our multilingual language models. For Aya Vision 8B, we use an LLM that’s initialized from Cohere Command R7B for improved instruction following and world knowledge and further post-trained using the Aya Expanse recipe consisting of diverse multilingual data, model merging, and preference training. For Aya Vision 32B, we initialize the language model from Aya Expanse 32B based on its state-of-the-art multilingual performance.
Training process
We trained Aya Vision models in 2 stages – vision-language alignment and supervised fine-tuning (SFT). Within the vision-language alignment stage, only the vision-language connector is trained, while the vision encoder and the language model weights are kept frozen. This permits rudimentary vision-language understanding by mapping the image encoder features to the language model embedding space. Within the SFT stage, we train each the connector and the language model on a various set of multimodal tasks in 23 languages.
Multimodal Data Enhancement and Expanding Language Coverage
Certainly one of the largest challenges in developing a multilingual vision-language model is ensuring strong performance across underrepresented languages. To handle this, we first gather synthetic annotations using a various pool of high-quality datasets in English, which lay the premise for our multilingual multimodal annotation. Following the synthetic annotations of English datasets, we translated a big volume of the info into 23 languages. To avoid translation artefacts and maintain fluent textual characteristics with high precision in answers, we then rephrased translated prompt/generation pairs by matching them with the unique high-quality synthetic samples, expanding language coverage where real-world datasets are scarce. This improves each linguistic fluency and alignment between vision and text, allowing Aya Vision to exhibit superior image understanding in multiple languages.
Our 8B model, when only supervised fine-tuned with original academic datasets, reaches a 40.9% win rate across 23 languages in AyaVisionBench against Pangea 7B, which is a multilingual VLM, whereas synthetic annotations and scaling up the multilingual data result in a 58.1% win rate with a gain of 17.2%. This significant improvement showcases the impact of great investment in multilingual data coverage.
Multimodal Model Merging
A state-of-the-art vision-language model should excel not only in image understanding but additionally in conversational context, where the model is predicted to generate a high-quality response to each image and text inputs. To handle this, inspired by our previous research on model merging, a way that mixes multiple trained models, we merge the bottom language model with the fine-tuned vision-language model.
Model merging enhances the generative capabilities of our final model that results in a 70% win rates across 23 languages on AyaVisionBench against Pangea 7B, improving the multimodal win rate by 11.9% in comparison with the model before merging.
Multimodal model merging also enables our Aya Vision models to excel in text-only tasks as measured in mArenaHard datasets compared with the opposite leading vision-language models.
Scaling as much as 32B
Finally, we scale our recipe from 8B to 32B, leading to the state-of-the-art open-weight multilingual vision-language model – Aya Vision 32B which shows significant improvements in win rates as a consequence of the stronger initialization of the text-backbone, and outperforms models greater than 2x of its size, equivalent to Llama-3.2 90B Vision, Molmo 72B, and Qwen2.5-VL 72B by win rates starting from 49% to 63% on AyaVisionBench and 52% to 72% on mWildVision average across 23 languages.
Aya Vision Benchmark – a multilingual evaluation data
Along with Aya Vision models, we also release a high-quality multilingual vision-language benchmark called AyaVisionBench, constructed based on real-world applications, covering 23 languages and 9 distinct task categories, with 135 image-question pairs per language.
We make this evaluation set available to the research community to push forward multilingual multimodal evaluations. This dataset is designed to evaluate a model’s ability to perform a various range of vision-language tasks, including captioning, chart and figure understanding, identifying differences between two images, general visual query answering, OCR, document understanding, text transcription, reasoning involving logic and math, and converting screenshots to code. By incorporating multiple languages and task types, the dataset provides a broad and difficult evaluation framework for assessing cross-lingual and multimodal understanding.
To create this dataset, we first chosen images from the Cauldron held-out test set, a big collection derived from 50 high-quality datasets, ensuring they’d not been seen during training. For every image, we then generated a corresponding query that explicitly required visual context for a solution. These questions were synthetically generated and subsequently refined through a two-stage verification process. First, human annotators reviewed and validated each query to make sure it was clear, relevant, and truly depending on the image. This rigorous selection and validation process ensures that the dataset serves as a strong benchmark for evaluating vision-language models in multilingual and real-world settings.
Designed for real-world applications
Communication happens in lots of forms and in lots of languages. With our leading research and development, we’ve released a model that facilitates connection, whether in text or visual, in 23 different languages today.
Aya Vision has a wide selection of practical applications, where one notable example is its availability on WhatsApp, probably the most broadly used communications platforms on this planet. This enables an enormous audience of world residents who speak a mess of languages to utilize the capabilities of Aya Vision on a platform they use to speak each day.
Getting Began with Aya
To start:
Download weights and datasets from the Aya Vision collection on Hugging Face.
Try Aya Vision using our Hugging Face Space or text it on Whatsapp
Construct on Aya using our colab example.
Learn more about our ongoing efforts around multilingual.
Acknowledgments
This work wouldn’t have been possible without the core Aya Vision technical team:
Saurabh Dash, Oliver Nan, John Dang, Arash Ahmadian Dehkordi, Shivalika Singh, Alejandro Salamanca, Bharat Venkitesh, Vlad Shmyhlo, Walter Beller-Morales, Jeremy Pekmez, Jason Ozuzu, Madeline Smith, Marzieh Fadaee, Manoj Govindassamy, Sudip Roy, Matthias Gallé, Beyza Ermis, Ahmet Üstün, Sara Hooker.
It also wouldn’t have been possible without the broader Cohere For AI and Cohere team who supported in many alternative ways. Special because of Sungjin Hong, Michael Kozakov, Pierre Richemond, Brittawnya Prince, Jim Payne, Kyle Lastovica, Jeff Colen, Jenna Cook, Viraat Aryabumi, Trent Fowler, Linus Chui, Meor Amer, Lucas Fayoux, Kyle Lastovica, Billy Trend, Acyr Locatelli, Morgan Norman, Florian Strub, Jon Ander Campos, Nick Frosst, Phil Blunsom, Aidan Gomez, Ivan Zhang.
Special thanks to Hugging Face for helping make this come together: Yoni Gozlan, Arthur Zucker, Pedro Cuenca, Aritra Roy Gosthipaty, Merve Noyan, Vaibhav Srivastav.
References
[1] Aya Expanse: Combining Research Breakthroughs for a Recent Multilingual Frontier
[2] Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
[3] WildVision: Evaluating Vision-Language Models within the Wild with Human Preferences
[4] SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
[5] What matters when constructing vision-language models?
[6] Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
[7] How Far Are We to GPT-4V? Closing the Gap to Industrial Multimodal Models with Open-Source Suites








