EAGLE: Exploring the Design Space for Multimodal Large Language Models with a Mixture of Encoders

-

The flexibility to accurately interpret complex visual information is a vital focus of multimodal large language models (MLLMs). Recent work shows that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, comparable to optical character recognition and document evaluation. Several recent MLLMs achieve this by utilizing a mix of vision encoders. Despite their success, there’s a scarcity of systematic comparisons and detailed ablation studies addressing critical facets, comparable to expert selection and the combination of multiple vision experts. This text provides an in depth exploration of the design space for MLLMs using a mix of vision encoders and resolutions, the Eagle framework that attempts to explore the design space for multimodal large language models with a mix of encoders. The findings reveal several underlying principles common to varied existing strategies, resulting in a streamlined yet effective design approach. Eagle discovers that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. Moreover, Eagle introduces Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. 

Eagle’s work is said to the final architecture design of multimodal large language models (MLLMs). Besides the road of representative open-source research mentioned earlier, other notable families of MLLMs include, but should not limited to, MiniGPT-4, Lynx, Otter, QwenVL, CogVLM, VILA, GPT-4V, Gemini, and Llama 3.1. Depending on how vision signals are integrated into the language model, MLLMs may be broadly categorized into “cross-modal attention” models and “prefix-tuning” models. The previous injects visual information into different layers of LLMs using cross-modal attention, whereas the latter treats the visual tokens as a part of the language token sequence and directly appends them with text embeddings. Eagle’s model belongs to the prefix-tuning family by following a LLaVA-styled multimodal architecture. Considering that MLLM is a fast-growing field, Eagle recommends referring to more detailed studies and surveys for further insights.

Eagle’s work is closely related to research focused on improving vision encoder designs for MLLMs. Early works often adopted vision encoders pre-trained on vision-language alignment tasks comparable to CLIP and EVA-CLIP. Stronger vision encoders, comparable to SigLIP and InternVL, have been proposed to reinforce vision-language tasks with higher designs, larger model sizes, and simpler training recipes. Since models are sometimes pre-trained on low-resolution images and should lack the flexibility to encode fine-grained details, higher resolution adaptation is regularly performed to extend the MLLM input resolution. Along with higher resolution adaptation, models like LLaVA-NeXT, LLaVA-UHD, Monkey, InternLM-XComposer, and InternVL use tiling or adaptive tiling to handle high-resolution input, where images are divided into lower-resolution patches and processed individually. While the flexibility to handle higher resolution is made possible by introducing additional vision experts, this approach differs barely from tiling techniques, though each are compatible and may be combined.

The success of enormous language models (LLMs) has sparked significant interest in enabling their visual perception capabilities, allowing them to see, understand, and reason in the true world. On the core of those multimodal large language models (MLLMs) is a typical design where images are converted right into a series of visual tokens by the vision encoders and appended with the text embeddings. CLIP is commonly chosen because the vision encoder because its visual representation is aligned with the text space by pre-training on image-text pairs. Depending on the architectures, training recipes, and the best way vision tokens are injected into the language model, notable families of MLLMs include Flamingo, BLIP, PaLI, PaLM-E, and LLaVA. Most of those models maintain relatively low input resolutions because of limitations in pre-trained vision encoders and LLM sequence length. Eagle’s work is closely aligned with models that use multiple vision encoders for improved perception. Mini-Gemini and LLaVA-HR propose fusing high-resolution visual features into low-resolution visual tokens. Beyond resolution issues, these pre-trained vision encoders may lack specific capabilities comparable to reading text or localizing objects. To handle this, various models integrate vision encoders pre-trained on different vision tasks to reinforce the vision encoder’s capabilities.

As an example, models like Mousi and Brave fuse visual tokens from different vision encoders by concatenating along the channel or token direction. RADIO introduces a multi-teacher distillation method to unify the talents of various vision encoders right into a single model. MoAI, IVE, and Prismer further use the output of vision experts, comparable to OCR, detection, or depth estimation, to complement additional information for MLLMs to generate answers. MoVA devises a routing network to assign an optimal vision model based on the given image and directions. 

Recent studies have shown that stronger vision encoder designs are necessary for reducing MLLM hallucinations and improving performance on resolution-sensitive tasks like optical character recognition (OCR). Several works concentrate on enhancing the aptitude of the vision encoder, either by scaling up the pre-training data and parameters or by dividing images into low-resolution patches. Nonetheless, these approaches often introduce large training resource demands. An efficient yet powerful strategy is mixing visual encoders pre-trained with different tasks and input resolutions, either by fusing higher resolution encoders with the CLIP encoder, sequentially appending features from different encoders, or adopting more complex fusion and routing strategies to maximise the advantages of various encoders. This “mixture-of-vision-experts” approach has proven effective, though an in depth study of its design space with rigorous ablation continues to be lacking, motivating Eagle to revisit this area. Key questions remain: which vision encoder combos to decide on, the way to fuse different experts, and the way to adjust training strategies with more vision encoders.

To handle these questions, Eagle systematically investigates the mixture-of-vision-encoders design space for improved MLLM perception. The exploration of this design space involves the next steps: 1) Benchmarking various vision encoders and trying to find higher resolution adaptation; 2) Conducting an “apples to apples” comparison between vision encoder fusion strategies; 3) Progressively identifying the optimal combination of multiple vision encoders; 4) Improving vision expert pre-alignment and data mixture. The exploration steps are illustrated in the next image. 

Eagle’s study covers the performance of vision encoders pre-trained on different tasks and resolutions, comparable to vision-language alignment, self-supervised learning, detection, segmentation, and OCR. Using a round-robin approach, Eagle begins with the essential CLIP encoder and adds one additional expert at a time, choosing the expert that gives the very best improvement in each round.

While Eagle’s work will not be the primary to leverage multiple vision encoders in MLLMs, the systematic study results in several key findings under this setting:

  • Unlocking the vision encoders during MLLM training matters. That is in contrast to models like LLaVA and others that consider multiple vision encoders or teachers, where freezing the vision encoders has been common practice.
  • Some recently proposed fusion strategies don’t show significant benefits. As a substitute, straightforward channel concatenation emerges as a straightforward yet competitive fusion strategy, offering the very best efficiency and performance.
  • Incorporating additional vision experts results in consistent gains. This makes it a promising path for systematically enhancing MLLM perception, apart from scaling up single encoders. The development is especially pronounced when vision encoders are unlocked.
  • Pre-alignment stage is vital. Eagle introduces a pre-alignment stage where non-text-aligned vision experts are individually fine-tuned with a frozen LLM before being trained together. This stage significantly enhances MLLM performance under the mixture-of-vision-encoder design.

Eagle: Methodology and Architecture

Unlike previous methods that concentrate on latest fusion strategies or architectures amongst vision encoders, Eagle’s goal is to discover a minimalistic design to fuse different vision encoders, supported by detailed ablations and removing any unnecessary components. As shown in the next figure, Eagle starts by extending the essential CLIP encoder to a set of vision experts with different architectures, pre-training tasks, and resolutions. With these experts, Eagle then compares different fusion architectures and methods and explores the way to optimize pre-training strategies with multiple encoders.

Finally, Eagle combines all of the findings and extends the approach to multiple expert vision encoders with various resolutions and domain knowledge. Using the identical pre-training data as LLaVA-1.5, which consists of 595k image-text pairs, Eagle moves to the supervised fine-tuning stage by collecting data from a series of tasks and converting them into multimodal conversations, including LLaVA-1.5, Laion-GPT4V, ShareGPT-4V, DocVQA, synDog-EN, ChartQA, DVQA, and AI2D, leading to 934k samples.

The model is first pre-trained with image-text pairs for one epoch with a batch size of 256, where all the model is frozen, and only the projector layer is updated. Within the second stage, the model is fine-tuned on the supervised fine-tuning data for one epoch with a batch size of 128. For this exploration, Eagle employs Vicuna-7B because the underlying language model. The educational rates are set to 1e-3 for the primary stage and 2e-5 for the second stage.

Stronger CLIP Encoder

Eagle begins the exploration with the CLIP model, because it has turn out to be the first selection for a lot of MLLMs. While CLIP models are known to reinforce multimodal tasks, their limitations have also been well-documented. For instance, many existing MLLMs are inclined to use the pre-trained CLIP resolutions (comparable to 224 × 224 or 336 × 336) as their input resolutions. In these cases, the encoders often struggle to capture fine-grained details necessary for resolution-sensitive tasks like OCR and document understanding.

To handle increased input resolution, a typical approach is tiling, where input images are divided into tiles and encoded individually. One other simpler method is to directly scale up the input resolution and interpolate the position embeddings of the vision transformer model if obligatory. Eagle compares these two approaches with frozen and unfrozen vision encoders across different resolutions, with the outcomes contained within the above table. The findings may be summarized as follows:

  • Unfreezing the CLIP encoder results in significant improvement when interpolating to the next MLLM input resolution that differs from the CLIP pre-training resolution, without performance degradation when resolutions remain the identical.
  • Freezing the CLIP encoder and directly adapting it to the next MLLM input resolution significantly harms performance.
  • Among the many strategies compared, directly interpolating to 448 × 448 with an unfrozen CLIP encoder proves to be each effective and efficient by way of performance and price.
  • One of the best CLIP encoder achieves performance near InternVL, despite being a much smaller model (300M vs. 6B) with less pre-training data.

It’s price noting that CLIP-448 allows Eagle to match the setting with LLaVA-HR and InternVL, where the CLIP encoders are similarly adapted to take 448 × 448 input and output 1024 patch tokens. For further investigation, Eagle follows this straightforward strategy of scaling up the input resolution and unlocking the vision encoder during training.

Eagle observes that existing popular fusion strategies, despite their design variations, may be broadly categorized as follows:

  1. Sequence Append: Directly appending the visual tokens from different backbones as an extended sequence.
  2. Channel Concatenation: Concatenating the visual tokens along the channel dimension without increasing the sequence length.
  3. LLaVA-HR: Injecting high-resolution features into low-resolution vision encoders using a mixture-of-resolution adapter.
  4. Mini-Gemini: Using the CLIP tokens as low-resolution queries to cross-attend one other high-resolution vision encoder in co-located local windows.
  5. Deformable Attention: A brand new baseline introduced on top of Mini-Gemini, where the vanilla window attention is replaced with deformable attention.

As a substitute of coaching a projector to concurrently align multiple vision experts as in LLaVA’s original pre-training strategy, we first align the representation of every individual expert with a smaller language model (Vicuna-7B in practice) using next-token-prediction supervision. As shown within the figure below, with pre-alignment, the entire training process consists of three steps: 1) training each pre-trained vision expert with their very own projector on SFT data, while keeping the language model frozen; 2) combining all of the vision experts from step one and training only the projector with image-text pairs data; 3) training the entire model on the SFT data. 

Eagle: Experiments and Results

After meticulously developing its strategies, Eagle has established the next principles for the model: (1) integrating more vision experts with an optimized training recipe; (2) combining multiple vision experts through direct channel concatenation; (3) pre-training the vision experts individually via pre-alignment. On this section, to further exhibit some great benefits of the Eagle models, additional training data is incorporated, and Eagle is compared against the present state-of-the-art MLLMs across various tasks. Eagle uses Vicuna-v1.5-7B, Llama3-8B, and Vicuna-v1.5-13B because the language models. For the vision encoders, based on the ends in Section 2.6, Eagle models are denoted as Eagle-X4, which incorporates 4 vision encoders: CLIP, ConvNeXt, Pix2Struct, and EVA-02, and Eagle-X5, which incorporates an extra SAM vision encoder.

Visual Query Answering Tasks

Eagle compares the model series across three Visual Query Answering (VQA) benchmarks, including GQA, VQAv2, and VizWiz. As shown in the next table, Eagle-X5 achieves state-of-the-art performance on GQA and VQAv2, highlighting some great benefits of incorporating additional vision experts.

OCR and Chart Understanding Tasks

To judge the OCR, document, and chart understanding capabilities of Eagle, the model is benchmarked on OCRBench, TextVQA, and ChartQA. As shown within the above table, Eagle significantly surpasses competitors on TextVQA, benefiting from its high-resolution architecture and integration of various vision encoders. Notably, Eagle maintains a simple design, supporting as much as 1024 tokens without requiring complex tile decomposition of images.

The figure below presents examples of OCR and document understanding cases. With high-resolution adaptation and the inclusion of more vision experts, Eagle can discover small text inside images and accurately extract information based on user instructions. 

To higher understand the advantages of introducing experts pre-trained on other vision tasks, the next figure visualizes results from a model with only the ConvNeXt and CLIP vision encoders, in comparison with the outcomes of Eagle-X5. With the complete set of vision encoders, the model successfully corrects mistakes, demonstrating that even when equipped with high-resolution vision encoders pre-trained on vision-language alignment, Eagle’s capabilities are further enhanced by integrating additional vision experts pre-trained on diverse vision tasks.

Multimodal Benchmark Evaluation

Eagle is evaluated on seven benchmarks for MLLMs to exhibit its capabilities from different perspectives, including MME, MMBench, SEED, MathVista, MMMU, ScienceQA, and POPE. Specifically, MME, MMBench, and SEED assess the general performance on various real-world tasks involving reasoning, recognition, knowledge, and OCR. MMMU focuses on difficult problems from diverse domains that require college-level knowledge. POPE evaluates the visual hallucinations of MLLMs. The metrics utilized in this evaluation adhere to the default settings of those benchmarks. Eagle reports the perception rating for MME, the en_dev split for MMBench, the image split of SEED, the test-mini split of MathVista, the val split of MMMU, the F1-score of POPE, and the image rating for ScienceQA, ensuring alignment with the reported scores from other models.

Final Thoughts

In this text, now we have talked about Eagle, an in-depth evaluation of the design space for integrating vision encoders into multimodal large language models. Unlike previous works that concentrate on designing novel fusion paradigms, Eagle finds that systematic design decisions matter and discovers a series of useful techniques. Step-by-step, Eagle optimizes the training recipe of individual vision encoders, identifies an extendable and efficient fusion method, and step by step combines vision encoders with different domain knowledge. The outcomes highlight the critical importance of basic design space considerations.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x