The best way to Integrate Computer Vision Pipelines with Generative AI and Reasoning

Generative AI is opening latest possibilities for analyzing existing video streams. Video analytics are evolving from counting objects to turning raw video content footage into real-time understanding. This allows more actionable insights.

The NVIDIA AI Blueprint for video search and summarization (VSS) brings together vision language models (VLMs), large language models (LLMs), and retrieval-augmented generation (RAG) with optimized ingestion, retrieval, and storage pipelines. A part of NVIDIA Metropolis, it supports each stored and real-time video understanding.

In previous releases, the VSS Blueprint introduced capabilities similar to efficient video ingestion, context-aware RAG, computer vision (CV) pipeline, and audio transcription. To learn more about these foundational features, see Advance Video Analytics AI Agents Using the NVIDIA AI Blueprint for Video Search and Summarization and Construct a Video Search and Summarization Agent with NVIDIA AI Blueprint.

This post explains latest features in the newest VSS Blueprint 2.4 release, which delivers 4 major upgrades that enable developers to:

Improve physical world understanding: VSS is now integrated with NVIDIA Cosmos Reason, a state-of-the-art reasoning VLM that delivers advanced physical AI reasoning and scene understanding for richer video analytics and insights.
Enhance Q&A: Recent knowledge graph features and cross-camera support include multi-stream Q&A, improved knowledge graph generation, agentic-based graph traversal, Neo4J and ArangoDB with cuGraph acceleration.
Unlock generative AI at the sting with event reviewer: Review events of interest found by CV pipelines and supply contextual insights with generative AI. Recent endpoints enable VSS to be configured as an intelligent add-on to CV pipelines. This is good for low-latency edge deployments.
Deploy with expanded hardware support: VSS is now available on multiple platforms built with NVIDIA Blackwell, including NVIDIA Jetson Thor, NVIDIA DGX Spark, and NVIDIA RTX Pro 6000 workstation and server editions.

Benchmark	VSS 2.3.1 accuracy	VSS 2.4 accuracy	VSS accuracy change
LongVideoBench	48.17%	68.32%	+20.15%
MLVU	61.24%	71.44%	+10.2%

	1 NVIDIA Jetson Thor	1-2 NVIDIA RTX PRO 6000 Blackwell WS/SE	4-8 NVIDIA RTX PRO 6000 Blackwell WS/SE
LLM	N/A	Llama 3.1 8B	Llama 3.1 70B
VLM	Cosmos Reason 1	Cosmos Reason 1	Cosmos Reason 1
Really helpful usage	Event Review	Event Review Video Summarization Video Q&A (Vector RAG)	Event Review File Summarization Video Q&A (Graph RAG)

The best way to Integrate Computer Vision Pipelines with Generative AI and Reasoning

Improve physical world understanding with Cosmos Reason

Enhance Q&A with knowledge graph and cross-camera support

Augment CV pipelines with VSS Event Reviewer

Deploy flexibly with expanded hardware support

Start with visual agentic AI

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The best way to Generate QR Codes in Python

The only repository to coach your VLM in pure PyTorch

DeepSeek strikes again

Exploring how AI will shape the longer term of labor

R²D²: Three Neural Breakthroughs Transforming Robot Learning from NVIDIA Research

The best way to Integrate Computer Vision Pipelines with Generative AI and Reasoning

Improve physical world understanding with Cosmos Reason

Enhance Q&A with knowledge graph and cross-camera support

Augment CV pipelines with VSS Event Reviewer

Deploy flexibly with expanded hardware support

Start with visual agentic AI

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.