Generative AI is opening latest possibilities for analyzing existing video streams. Video analytics are evolving from counting objects to turning raw video content footage into real-time understanding. This allows more actionable insights.Â
The NVIDIA AI Blueprint for video search and summarization (VSS) brings together vision language models (VLMs), large language models (LLMs), and retrieval-augmented generation (RAG) with optimized ingestion, retrieval, and storage pipelines. A part of NVIDIA Metropolis, it supports each stored and real-time video understanding.Â
In previous releases, the VSS Blueprint introduced capabilities similar to efficient video ingestion, context-aware RAG, computer vision (CV) pipeline, and audio transcription. To learn more about these foundational features, see Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization and Construct a Video Search and Summarization Agent with NVIDIA AI Blueprint.
This post explains latest features in the newest VSS Blueprint 2.4 release, which delivers 4 major upgrades that enable developers to:
- Improve physical world understanding: VSS is now integrated with NVIDIA Cosmos Reason, a state-of-the-art reasoning VLM that delivers advanced physical AI reasoning and scene understanding for richer video analytics and insights.
- Enhance Q&A: Recent knowledge graph features and cross-camera support include multi-stream Q&A, improved knowledge graph generation, agentic-based graph traversal, Neo4J and ArangoDB with cuGraph acceleration.Â
- Unlock generative AI at the sting with event reviewer: Review events of interest found by CV pipelines and supply contextual insights with generative AI. Recent endpoints enable VSS to be configured as an intelligent add-on to CV pipelines. This is good for low-latency edge deployments.Â
- Deploy with expanded hardware support: VSS is now available on multiple platforms built with NVIDIA Blackwell, including NVIDIA Jetson Thor, NVIDIA DGX Spark, and NVIDIA RTX Pro 6000 workstation and server editions.
Improve physical world understanding with Cosmos ReasonÂ
Cosmos Reason is an open, customizable, 7-billion-parameter state-of-the-art reasoning VLM for physical AI. It allows vision AI agents to reason like humans using prior knowledge, physics understanding, and customary sense to know and act in the actual world. Cosmos Reason enables developers to construct AI agents that may see, analyze, and act within the physical world by analyzing petabytes of recorded videos or tens of millions of live streams. The Cosmos Reason NIM can be now available, delivering a production-ready VLM endpoint for constructing intelligent visual AI agents with fast, scalable reasoning.
Video analytics AI agents built with the VSS Blueprint 2.4 can use Cosmos Reason to extract accurate and wealthy dense captions, enumerate objects of interest with set of mark prompting, provide priceless insights and perform root-cause evaluation on footage from multiple industries including manufacturing lines, logistic warehouses, retail stores, and transportation networks.
VSS 2.4 supports native integration with Cosmos Reason. This support tightly couples the video ingestion process with the VLM to permit for efficient batching and speed ups impossible with REST API-based VLM interfaces. Cosmos Reason’s small 7B parameter footprint, makes it easy to make use of for edge deployments in addition to within the cloud. Cosmos Reason is fully customizable and in a position to be fine-tuned with proprietary data. Â
Enhance Q&A with knowledge graph and cross-camera supportÂ
Ingesting large amounts of video is difficult because the information is unstructured, continuous, and very high-volume, which makes it difficult to look, index, or summarize efficiently. A single video can span hours of footage, include multiple events happening directly, and require heavy compute simply to decode and analyze. Standard computer vision pipelines often can’t sustain at scale, producing isolated detections without the broader context needed to know what’s actually happening.
VSS solves this problem through the use of a GPU-accelerated video ingestion pipeline. As a video file or live stream is available in, it’s broken into smaller chunks and Cosmos Reason VLM generates a wealthy description or caption of every chunk. An LLM then extracts the vital information from the VLM generated captions to construct a knowledge graph that captures the vital details of the video. Once the knowledge graph is built a big language model traverses the graph to reply user’s questions on the videos.Â


VSS 2.4 enhances Q&A accuracy and cross camera understanding with:Â
- Entity deduplication within the knowledge graph
- Agent-based graph traversal
- CUDA-accelerated graph databaseÂ
In previous releases of the VSS Blueprint, constructing the knowledge graph could end in duplicate nodes and edges. In VSS Blueprint 2.4, a knowledge graph post-processing has been added to remove any duplicate entries and merge nodes and edges which are common across videos. Which means that common entities similar to the identical automobile moving across multiple cameras at the moment are merged right into a single entity which improves the power of VSS to know unique objects as they move throughout a video and across cameras.Â
Once the knowledge graph has been generated and post processed, an LLM is used to traverse the graph and gather the vital information to reply the user’s query in regards to the videos.Â
In VSS 2.4, agentic based-reasoning has been introduced for advanced knowledge graph retrieval. If enabled, an LLM based agent will intelligently decompose the query after which use a set of tools to look the graph, find relevant metadata, reinspect sampled frames from the video and iterate if vital to accurately answer the user’s query.Â


| Benchmark | VSS 2.3.1 accuracy | VSS 2.4 accuracy | VSS accuracy change |
| LongVideoBench | 48.17% | 68.32% | +20.15% |
| MLVU | 61.24% | 71.44% | +10.2% |
It’s now possible to reply questions across multiple camera streams using the knowledge graph postprocessing to merge entities and relationships and the advanced agentic-based retrieval.


To supply developers with the newest tools, the supported Graph database backends have been expanded to incorporate ArangoDB. Users now have the power to configure VSS to make use of either the Neo4J or ArangoDB graph database backend. ArangoDB brings a set of enhancements, including CUDA-accelerated graph functions to speed up knowledge graph generation. For more details on the ArangoDB integration into VSS, see Generate a Video Knowledge Graph: NVIDIA VSS Blueprint with GraphRAG on ArangoDB.
These latest features to enable knowledge graph generation and agentic Q&A are best suited to multi-GPU deployments that may handle large LLMs and multiple concurrent VLM requests.
Augment CV pipelines with VSS Event Reviewer
For smaller-scale and edge deployments, the brand new VSS Event Reviewer feature introduces API endpoints that make it easy to integrate VSS into existing computer vision pipelines for low-latency alerts and direct VLM Q&A on video segments.Â
As an alternative of running VSS constantly on all files or streams, Event Reviewer allows VSS to act as an intelligent add-on that delivers VLM insights just for key moments. This approach greatly reduces compute costs, making VSS well-suited for lightweight deployments and edge platforms.
While standard CV pipelines excel at detecting objects and folks or applying analytics to discover events, similar to possible vehicle collisions, they often generate false positives and lack deeper scene understanding.
VSS could be used to reinforce these CV pipelines by analyzing short video clips flagged by the CV system, reviewing the detected events, and uncovering additional insights that traditional methods may miss. Â
Figure 4 shows how VSS can augment an existing pipeline. The pc vision pipeline represents any proprietary system that’s able to taking in video files or streams and outputting short clips of interest. The Event Reviewer endpoints can then be called to pass these short video clips to VSS to generate alerts and follow up Q&A questions with a VLM.Â


To exhibit this feature, a sample DeepStream detection pipeline is provided within the VSS GitHub repository using GroundingDINO. This instance pipeline ingests a video, runs detection after which outputs clips when the variety of detected objects is larger than a set threshold. The aim of this pipeline is to seek out an important events from the video that have to be inspected by VSS with a VLM.Â
VSS will then process each small clip using the VLM by answering a set of yes/no questions defined by the user. These responses are converted to true/false states for every query and could be used to generate low latency alerts to a user. Once the short clip has been processed by VSS, you may ask more detailed follow questions.
This approach selectively uses the VLM only on clips of interest as determined by a light-weight detection pipeline. It will probably drastically reduce compute cost by reducing VLM calls and freeing up the GPU for other workloads.
Deploy flexibly with expanded hardware support
VSS Blueprint 2.4 fully supports several NVIDIA Blackwell platforms, including NVIDIA RTX Pro 6000 server and workstation editions and NVIDIA Jetson Thor for edge deployments. Support for NVIDIA DGX Spark is coming soon.Â
| 1 NVIDIA Jetson Thor | 1-2 NVIDIA RTX PRO 6000 Blackwell WS/SE | 4-8 NVIDIA RTX PRO 6000 Blackwell WS/SE | |
| LLM | N/A | Llama 3.1 8B | Llama 3.1 70B |
| VLM | Cosmos Reason 1Â | Cosmos Reason 1Â | Cosmos Reason 1Â |
| Really helpful usage | Event Review | Event Review Video Summarization Video Q&A (Vector RAG) | Event Review File Summarization Video Q&A (Graph RAG) |
For a full list of supported platforms, see the Supported Platforms section of the VSS documentation.Â
Start with visual agentic AIÂ
The brand new VSS Blueprint 2.4 release brings latest visual agentic AI capabilities to the sting, improvements to spice up Q&A accuracy, cross-camera understanding, and expansion in platform support. The enhancements to knowledge graph creation and traversal improves Q&A accuracy and enables cross camera queries.Â
For edge deployments and alerting use cases, the Event Reviewer feature is a solution to use VSS as an intelligent add-on to cv pipelines for low latency alerts. Prolonged platform support to incorporate NVIDIA RTX Pro and NVIDIA Thor. Â
To quickly start with the VSS Blueprint, use an NVIDIA Brev Launchable. The launchable provides fast one-click deployment and Jupyter notebooks to walk through learn how to launch VSS, access the Web UI, and use the VSS REST APIs. Visit the NVIDIA-AI-Blueprints/video-search-and-summarization GitHub repo for more technical resources similar to training notebooks and reference code. For more technical questions, visit NVIDIA Developer Forum.Â
For details about production deployments and CSPs, see the Cloud section of the VSS documentation. Â
