Alibaba has introduced the brand new open source Qwen3.5 series built for native multimodal agents. The primary model on this series is a ~400B parameter native vision-language model (VLM) with reasoning built with a hybrid architecture of mixture of experts (MoE) and Gated Delta Networks. Qwen3.5 can understand and navigate user interfaces, which improves on the previous generation of VLMs.
Qwen3.5 is good for quite a lot of use cases, including:
- Coding, including web development
- Visual reasoning, including mobile and web interfaces
- Chat applications
- Complex search
| Qwen3.5 | |
| Modalities | Vision, language |
| Total parameters | 397B |
| Energetic parameters | 17B |
| Activation rate | 4.28% |
| Input context length | 256K extensible to 1M tokens |
| Languages supported | 200+ |
| Additional configuration information | |
| Experts | 512 |
| Shared experts | 1 |
| Experts per token | 11 (10 routed + 1 shared) |
| Layers | 60 |
| Words (vocabulary) | 248,320 |
Construct with NVIDIA endpoints
You may start constructing with Qwen3.5 today with free access to GPU-accelerated endpoints on construct.nvidia.com, powered by NVIDIA Blackwell GPUs. As a part of the NVIDIA Developer Program, you possibly can explore quickly within the browser, experiment with prompts, and even test the model along with your own data to judge real-world performance.
It’s also possible to use the NVIDIA-hosted model through the API, free with registration within the NVIDIA Developer Program.
import requests
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions"
headers = {
"Authorization": "Bearer $NVIDIA_API_KEY",
"Accept": "application/json",
}
payload = {
"messages": [
{
"role": "user",
"content": ""
}
],
"model": "qwen/qwen3.5-397b-a17b",
"chat_template_kwargs": {
"considering": True
},
"frequency_penalty": 0,
"max_tokens": 16384,
"presence_penalty": 0,
"stream": True,
"temperature": 1,
"top_p": 1
}
# re-use connections
session = requests.Session()
response = session.post(invoke_url, headers=headers, json=payload)
response.raise_for_status()
response_body = response.json()
print(response_body)
To benefit from tool calling, simply define an array of OpenAI compatible tools so as to add to the chat completions tools parameter.
NVIDIA NIM makes it easy to take Qwen3.5 from development into production. Available as optimized, containerized inference microservices, NIM packages the model with the performance tuning, standardized APIs, and deployment flexibility enterprises need. Download and run it anywhere; on-premises, within the cloud, or across hybrid environments.
Customize with NVIDIA NeMo
While Qwen3.5 offers impressive “out-of-the-box” multimodal capabilities, the NVIDIA NeMo framework provides the essential tools to adapt it for specialised domain needs. Using the NeMo Automodel library, developers can fine-tune the Qwen3.5 397B-parameter architecture with high-throughput efficiency.
NeMo Automodel is a PyTorch-native training library that gives Day 0 Hugging Face support, enabling direct training on existing checkpoints without tedious model conversions. This facilitates rapid experimentation, whether performing full supervised fine-tuning (SFT) or using memory-efficient methods equivalent to LoRA.
As a reference implementation guide, developers can leverage the technical tutorial on Medical Visual QA, which details how one can fine-tune Qwen3.5 on radiological datasets. For enormous scale, NeMo supports multinode Slurm and Kubernetes deployments, ensuring that even the most important MoE models are optimized for domain-specific reasoning and complicated agentic workflows with minimal latency.
Start with Qwen3.5
From data center deployments on NVIDIA Blackwell to NVIDIA NIM microservice for containerized deployment anywhere, NVIDIA offers solutions on your integration of Qwen3.5. To start, try the Qwen3.5 model page on Hugging Face and test Qwen3.5 on construct.nvidia.com.
