Organizations are increasingly in search of ways to extract insights from video, audio, and other complex data sources. Retrieval-augmented generation (RAG) enables generative AI systems to make use of proprietary enterprise data. Nonetheless, incorporating video content into these workflows introduces recent technical hurdles, akin to efficient ingestion, indexing, and maintaining compliance across diverse sources.
This blog post introduces an integrated approach for enriching video evaluation and summarization using the NVIDIA AI Blueprint for video search and summarization (VSS) and the NVIDIA AI Blueprint for retrieval-augmented generation (RAG). By composing these workflows, developers can complement video understanding with trusted, context-rich enterprise data, unlocking deeper insights for business-critical applications.
On this post, you’ll learn :
- Integrate VSS and RAG Blueprints for multimodal search and summarization.
- Enrich video analytics with contextual enterprise knowledge.
- Architect scalable, modular workflows for real-time video Q&A and summarization.
- Apply these solutions to real-world use cases across industries.
Following up on our earlier post in regards to the VSS Blueprint, we’ll now explain how merging VSS with RAG improves video evaluation. This mix provides more accurate, context-aware insights for enterprise AI applications.
What are NVIDIA AI Blueprints?
NVIDIA AI Blueprints are customizable reference workflows for constructing generative AI pipelines. Developers can use NVIDIA AI Blueprints to construct multimodal RAG pipelines. The RAG Blueprint is built on NVIDIA NeMo Retriever models for constantly indexing multimodal documents for fast and accurate semantic search at enterprise scale. The VSS Blueprint ingests massive volumes of streaming or archival video for search, summarization, interactive Q&A, and event-trigger actions akin to alerting.
An actual-world application: Constructing AI-powered health insights with RAG and VSS Blueprints
The next is an example comparing raw VSS Blueprint output to context-enriched insights with the RAG Blueprint. The input video shows someone making breakfast. This use case illustrates how AI can analyze what an individual is eating for breakfast and comment on the health of their eating habits. In the primary example, the AI generates a video summary with none additional RAG information, and within the second example, the AI uses data from RAG, leading to a more detailed and informative summary. The primary screen capture shows the VSS Blueprint’s default video event summarization of a breakfast preparation routine. The output clusters key actions under categories like ingredient selection, cooking techniques, dietary insights, hygiene practices, and presentation suggestions. The default VSS output is factual and descriptive, however it doesn’t connect observed activities to dietary value or healthy habits.


The following figure shows a summarization enriched by the Wiki page for a healthy weight-reduction plan. After integrating with the RAG Blueprint, VSS draws on these dietary guidelines and best practices so as to add context. The enriched summary describes the actions and highlights the advantages of selecting whole grains, the importance of fiber, the dietary value of dairy, and the role of hygiene in food safety.


Figure 2. VSS summary enriched with RAG, connecting observed actions to dietary value and healthy habits
By connecting video understanding to external knowledge, the enriched summary helps viewers make informed decisions about food selections and healthy habits. It translates video content into practical insights that support on a regular basis well-being—making nutrition information accessible and actionable for all.
Deployment steps
To deploy this solution, follow these steps.
NOTE: This instance assumes that the RAG Blueprint is already installed and accessible via a distant endpoint.
- Download and deploy the RAG Blueprint from https://github.com/NVIDIA-AI-Blueprints/rag.
- Clone the video-search-and-summarization repo:
$ git clone https://github.com/NVIDIA-AI-Blueprints/video-search-and-summarization.git
- Edit the src/vss-engine/docker/Dockerfile file to use the mixing patches:
diff --git a/src/vss-engine/docker/Dockerfile b/src/vss-engine/docker/Dockerfile
index 58b25e3..e1df783 100644
--- a/src/vss-engine/docker/Dockerfile
+++ b/src/vss-engine/docker/Dockerfile
@@ -17,7 +17,7 @@ RUN --mount=type=bind,source=binaries/gradio_videotimeline-1.0.2-py3-none-any.wh
pip install --no-deps /tmp/gradio_videotimeline-1.0.2-py3-none-any.whl
-RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b v1.0.0 /tmp/vss-ctx-rag
+RUN git clone https://github.com/NVIDIA/context-aware-rag.git -b dev/vss-external-rag-support-v2 /tmp/vss-ctx-rag
ARG TARGETARCH
RUN pip install /tmp/vss-ctx-rag --no-deps &&
if [ "$TARGETARCH" = "amd64" ]; then
- Proceed with the VSS deployment steps in src/vss-engine/README.md to deploy the patched VSS Blueprint.
Test the mixing
The next code snippet shows the kubectl exec syntax for executing the VSS pod in Kubernetes with the decorated prompt. It’s analyzing a meal preparation video and enriching it with details about relevant dietary guidelines.
import subprocess, textwrap
deployment_id = "vss-vss-deployment-595d5b4ccb-8678v"
vid_id = "6482b573-3aa6-4231-b981-a3e75806826b"
def run_in_vss(pod, cmd):
subprocess.run(
["kubectl", "exec", pod, "-c", "vss", "--",
"/bin/bash", "-c", cmd],
check=True, text=True)
prompt = textwrap.dedent("""
Summarize key events only.
Breakfast nutriontal guidelines?
""")
cmd = f"""python3 via_client_cli.py summarize
--id {vid_id} --model vila-1.5 --enable-chat
--chunk-duration 10
--caption-summarization-prompt "{prompt}"
"""
run_in_vss(deployment_id, cmd)
Every part inside
Returned context is inserted into the enrichment prompt set within the tunable VECTOR_RAG_ENRICHMENT_PROMPT before LLM generation.
The tunable enrichment prompt utilized in the dietary example is pictured below.
Here is the summary generated in regards to the meal preparation video:
{original_response}
Here is additional dietary and food safety information:
{external_context}
Please enrich the summary by naturally incorporating relevant dietary facts, food safety guidelines, and practical advice from the external context. Connect observed actions within the video to their health advantages, akin to highlighting the worth of specific ingredients, cooking methods, or hygiene practices. Make sure the enrichment is contextual, informative, and supports on a regular basis healthy selections.
Don't include any introductory phrases, notes, explanations, or comments about how the inputs were combined. Don't reference the unique summary or external context. Only provide the enriched summary itself, organized as bullet points under the categories: Ingredient Selection, Cooking Techniques, Dietary Insights, Hygiene Practices, and Presentation Suggestions.
How it really works
- Ingestion
- VSS ingests video streams, creates caption chunks, and indexes the visual metadata.
- RAG ingests proprietary documents akin to manuals, historical event statistics, and media guides right into a GPU-accelerated vector store.
- Query flow
- A user asks, “Am I maintaining a healthy diet today?”
- VSS surfaces candidate segments of the user’s meal.
- VSS also queries the RAG server to fetch the relevant knowledge indexed from various health guidelines.
- Knowledge fusion
- The RAG Blueprint retrieves relevant enterprise health knowledge and feeds it to the VSS LLM to craft a grounded answer together with the candidate segment from the video
- Response
- The ultimate response is anchored within the video data, enriched with relevant external knowledge, and delivered to the user in real time with proper citations.
VSS and RAG Blueprints integration architecture
Figure 3 shows the modular integration architecture that produces these results.
- VSS ingests video streams, generates captions and metadata, and supports question-answering and summarization over video content.
- The RAG Blueprint is deployed as a stand-alone microservice. It indexes, searches, and retrieves knowledge from enterprise-wide data sources akin to text documents, PDFs, tables, and policy manuals.
- VSS and RAG Blueprints communicate over defined APIs. At any time when a prompt includes text inside
… tags, the VSS Blueprint sends that sub-prompt to the external RAG server. - The RAG Blueprint receives the sub-prompt and returns relevant context.
- The VSS Blueprint uses a customizable enrichment prompt to fuse the retrieved context into its final summarization or chat Q&A response.
This modular, API-based integration enables the blueprints for use together or individually, and to scale independently based on user demand.


Connecting workflows: How composable AI Blueprints support collaboration
By composing multiple NVIDIA AI Blueprints, developers can integrate specialized pipelines—akin to video analytics and enterprise retrieval—to resolve cross-functional challenges. This modular composability accelerates development while extending functionality beyond what any single blueprint can achieve.
Let’s break down how composability delivers flexible integration, cross-team collaboration, and context-rich results:
- Flexible integration: Mix specialized blueprints, VSS for video processing, and RAG for knowledge retrieval to construct tailored, scalable solutions.
- Cross-functional collaboration: Distinct blueprints enable cooperation between video engineers, data scientists, and subject-matter experts, enriching video analytics with enterprise knowledge.
- Context-aware results: User queries in VSS Blueprints can use RAG Blueprints to complement video summaries with relevant information from organizational documents, for precise, actionable insights.
The VSS Blueprint processes video streams for detection and captioning, while the RAG Blueprint retrieves relevant information from text and structured data sources. User queries to VSS Blueprints might be forwarded to RAG Blueprints for extra context, and the combined response incorporates each video evaluation and enterprise knowledge.
Optimizing for enterprise workflows: The case for dedicated RAG
A key architectural decision to maintain RAG Blueprint as a separate, standalone server as an alternative of merging all sources, akin to video and docs, was driven by several real-world aspects:
- Multi-workstream support: The RAG Blueprint serves multiple workflows (search portals, chatbots, dashboards, compliance tools) as a unified knowledge layer. The VSS Bleuprints acts as one among many purchasers accessing this backend.
- Decoupled scaling: The blueprints might be scaled and optimized independently for targeted resource allocation for video and document workloads.
- Rapid innovation and security: Centralized RAG management simplifies updates, patching, and security improvements without affecting VSS deployments.
- Minimal Integration Overhead: VSS integration requires only the RAG server endpoint and environment variables—no must rebuild or re-index video data for brand spanking new use cases.
It is necessary to notice that the VSS Blueprint also includes RAG capability. Although the VSS Blueprint can even retrieve enterprise documents, the pipeline is very tuned for accurate video search and retrieval. Similarly, the RAG Blueprint also supports most of the same modalities because the VSS Blueprint. However the RAG Blueprint is optimized to go looking and retrieve multi-lingual, multi-modal business documents akin to PDFs that include text, tables, and charts. Loosely coupling the pipelines via API calls gives the developers a “better of each worlds” experience across two highly specialized pipelines.
Latency impact
We also assessed the performance impact of mixing the blueprints for video summarization and Q&A. The whole latency includes time spent in VSS operations, time spent in RAG operations, and time spent integrating the outcomes.
The system latency for every use case is depicted in Table 1.
Within the chat Q&A use case, the addition of RAG input accounts for 10% of the general latency. Enriching the video summarization with RAG data incurs about 1% of the general pipeline latency.


| Pipeline Stage | VSS Summarization Latency (seconds) | VSS Chat Q&A Latency (seconds) |
| RAG retrieval | 1.69 | 1.81 |
| LLM fusion | 1.24 | 1.35 |
| End-to-End | 250 | 29.77 |
| VSS Summarization / Chat Q&A (Foremost Task) | 247.07 | 26.61 |
How industries are using blueprints to make smarter, faster decisions
From construction sites to forests to stadiums, the mixing of VSS and RAG Blueprints through prompt fusion converts raw video into precious, context-rich insights without incurring additional latency. The next examples highlight how the mixing helps address real-world challenges:
- Shimizu implements the technology on construction sites to stream job-site footage, monitor development progress, prevent unsafe behaviors, and improve safety and compliance.
- Cloudian’s HyperScale AIDP forestry management demo deploys VSS and RAG Blueprints for detecting overgrowth and invasive species, immediately retrieving relevant policy documents, and generating actionable reports for fire insurance and compliance.
- Monks uses the answer to quickly generate personalized sports highlights, turning large content libraries into tailored, engaging clips for social and broadcast platforms.


Visit https://construct.nvidia.com/blueprints to begin developing your individual complex, accelerated pipelines.
