Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Alibaba has introduced the brand new open source Qwen3.5 series built for native multimodal agents. The primary model on this series is a ~400B parameter native vision-language model (VLM) with reasoning built with a hybrid architecture of mixture of experts (MoE) and Gated Delta Networks. Qwen3.5 can understand and navigate user interfaces, which improves on the previous generation of VLMs.

Qwen3.5 is good for quite a lot of use cases, including:

Coding, including web development
Visual reasoning, including mobile and web interfaces
Chat applications
Complex search

Qwen3.5
Modalities	Vision, language
Total parameters	397B
Energetic parameters	17B
Activation rate	4.28%
Input context length	256K extensible to 1M tokens
Languages supported	200+
Additional configuration information
Experts	512
Shared experts	1
Experts per token	11 (10 routed + 1 shared)
Layers	60
Words (vocabulary)	248,320

Table 1. Specifications and configuration details for the Qwen3.5 model

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Construct with NVIDIA endpoints

Customize with NVIDIA NeMo

Start with Qwen3.5

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

OpenAI is throwing every thing into constructing a completely automated researcher

How 81K people really feel about AI

Cloud service providers ask EU regulator to reinstate VMware partner program

Linear Regression Is Actually a Projection Problem, Part 1: The Geometric Intuition

The Basics of Vibe Engineering

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Construct with NVIDIA endpoints

Customize with NVIDIA NeMo

Start with Qwen3.5

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.