A Walkthrough of Nvidia’s Latest Multi-Modal LLM Family

From LLaVA, Flamingo, to NVLM

Multi-modal LLM development has been advancing fast lately.

Although the industrial multi-modal models like GPT-4v, GPT-4o, Gemini, and Claude 3.5 Sonnet are probably the most eye-catching performers today, the open-source models resembling LLaVA, Llama 3-V, Qwen-VL have been steadily catching up by way of performance on public benchmarks.

Just last month, Nvidia released their open-source multi-modal LLM family called NVLM. The family comprises three architectures: a) decoder-based, b) cross-attention-based, and c) hybrid. The decoder-based model takes each the image and text tokens to a pre-trained LLM, resembling the LLaVA model. The cross-attention-based model uses the image token embeddings because the keys and values while using the text token embeddings because the queries; for the reason that attention is calculated using different sources, it’s called “cross-attention” as in the unique transformer decoder slightly than the self-attention as in decoder-only models. The hybrid architecture is a novel design merging the decoder and cross-attention architecture for the good thing about multi-modal reasoning, fewer training parameters, and taking high-resolution input. The 72B decoder-based NVLM-D model achieved a formidable performance, beating state-of-the-art open-source and industrial models on tasks like natural image understanding and OCR.

In this text, I’m going to walk through the next things:

the dynamic high-resolution (DHR) vision encoder, which all of the NVLM models adopt
the decoder-based model, NVLM-D, in comparison with LLaVA
the gated cross-attention model, NVLM-X, in comparison with Flamingo
the hybrid model, NVLM-H

In the long run, I’ll show the NVLM-D 72B performance. In comparison with state-of-the-art open-source and industrial models, the NVLM-D model shows stability over text-based tasks and superior performance on natural understanding and OCR tasks.