Construct with Kimi K2.5 Multimodal VLM Using NVIDIA GPU-Accelerated Endpoints

-


Kimi K2.5 is the latest open vision language model (VLM) from the Kimi family of models. Kimi K2.5 is a general-purpose multimodal model that excels in current high-demand tasks similar to agentic AI workflows, chat, reasoning, coding, mathematics, and more.  

The model was trained using the open source Megatron‑LM framework. Megatron-LM provides accelerated computing for scalability and GPU optimization through several varieties of parallelism (tensor, data, sequence) for training massive transformer-based models.  

This model architecture builds on leading state-of-the-art large open models for efficiency and capability. The model consists of 384 experts with a single dense layer, which allows for smaller-sized experts and specialized routing for various modalities. Kimi K2.5 achieves a 3.2% activation rate of parameters per token. 

Kimi K2.5 
Modalities  Text, image, video 
Total parameters  1T 
Lively parameters  32.86B 
Activation rate  3.2% 
Input context length  262K 
Additional configuration information 
# experts  384 
# shared experts  1 
# experts per token  8 
# layers  61 (1 dense, 60 MoE) 
# attention heads  64 
Vocab size  ~164K 
Table 1. Specifications and configuration details for the Kimi K2.5 model

For vision capability, the big training vocabulary of 164K incorporates vision-specific tokens. Kimi created the MoonViT3d Vision Tower for the visual processing component of this model, which converts images and video frames into embeddings. 

Illustration of the Kimi K2.5 Vision Pipeline, which consists of a Vision Tower (MoonViT3d) (left), a Visual and Text Embedding Merger (center), and a Language Model (right). Illustration of the Kimi K2.5 Vision Pipeline, which consists of a Vision Tower (MoonViT3d) (left), a Visual and Text Embedding Merger (center), and a Language Model (right).
Figure 1. Kimi K2.5 vision pipeline 

Construct with NVIDIA GPU-accelerated endpoints 

You may start constructing with Kimi K2.5 with free access for prototyping to GPU-accelerated endpoints on construct.nvidia.com as a part of the NVIDIA Developer Program. You should use your personal data within the browser experience. NVIDIA NIM microservices, containers for production inference, are coming soon. 

Video 1. Learn methods to you possibly can test Kimi K2.5 on NVIDIA GPU-accelerated endpoints

You also can use the NVIDIA-hosted model through the API, free with registration within the NVIDIA Developer Program.  

import requests 
  
invoke_url = "https://integrate.api.nvidia.com/v1/chat/completions" 
  
headers = { 
	"Authorization": "Bearer $NVIDIA_API_KEY", 
	"Accept": "application/json", 
} 
  
payload = { 
  "messages": [ 
	{ 
  	"role": "user", 
  	"content": "" 
	} 
  ], 
  "model": "moonshotai/kimi-k2.5", 
  "chat_template_kwargs": { 
	"considering": True 
  }, 
  "frequency_penalty": 0, 
  "max_tokens": 16384, 
  "presence_penalty": 0, 
  "stream": True, 
  "temperature": 1, 
  "top_p": 1 
} 
  
# re-use connections 
session = requests.Session() 
  
response = session.post(invoke_url, headers=headers, json=payload) 
  
response.raise_for_status() 
response_body = response.json() 
print(response_body)

To make the most of tool calling, simply define an array of OpenAI compatible tools so as to add to the chat completions tools parameter.

Deploying with vLLM 

When deploying models with the vLLM serving framework, use the next instructions. For more information, see the vLLM recipe for Kimi K2.5.

$ uv venv
$ source .venv/bin/activate
$ uv pip install -U vllm --pre 
   --extra-index-url https://wheels.vllm.ai/nightly/cu129 
   --extra-index-url https://download.pytorch.org/whl/cu129 
   --index-strategy unsafe-best-match

Effective-tuning with NVIDIA NeMo Framework 

Kimi K2.5 may be customized and fine-tuned with the open source NeMo Framework using NeMo AutoModel library to adapt the model for domain-specific multimodal tasks, agentic workflows, and enterprise reasoning use cases. 

NeMo Framework is a collection of open libraries enabling scalable model pretraining and post-training, including supervised fine-tuning, parameter-efficient methods, and reinforcement learning for models of all sizes and modalities. 

NeMo AutoModel is a PyTorch Distributed native training library inside NeMo Framework that gives high throughput training directly on the Hugging Face checkpoint without the necessity for conversion. This provides a light-weight and versatile tool for developers and researchers to do rapid experimentation on the most recent frontier models. 

Try fine-tuning Kimi K2.5 with the NeMo AutoModel recipe. 

Start with Kimi K2.5  

From data center deployments on NVIDIA Blackwell to the fully managed enterprise NVIDIA NIM microservice, NVIDIA offers solutions to your integration of Kimi K2.5. To start, try the Kimi K2.5 model page on Hugging Face and Kimi API Platform, and test Kimi K2.5 on the construct.nvidia.com playground.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x