Constructing Scalable AI on Enterprise Data with NVIDIA Nemotron RAG and Microsoft SQL Server 2025

-


At Microsoft Ignite 2025, the vision for an AI-ready enterprise database becomes a reality with the announcement of Microsoft SQL Server 2025, giving developers powerful recent tools like built-in vector search and SQL native APIs to call external AI models. NVIDIA has partnered with Microsoft to seamlessly connect SQL Server 2025 with the NVIDIA Nemotron RAG collection of open models. This allows you to construct high-performance, secure AI applications in your data within the cloud or on-premises.

Retrieval-augmented generation (RAG) is essentially the most effective approach for enterprises to place their data to make use of. RAG grounds AI in live, proprietary data without the immense cost and complexity of retraining a model from scratch. Yet the effectiveness of RAG relies on compute-intensive steps, one in all which is vector embedding generation. This creates an enormous performance bottleneck on traditional CPU infrastructure.

This challenge is compounded by the complexity of deployment at scale and the necessity for model flexibility. Enterprises require a portfolio of embedding models to balance accuracy, speed, and price for various tasks.

This post details the brand new NVIDIA reference architecture that solves this problem. It’s built on SQL Server 2025 and Llama Nemotron Embed 1B v2, a part of the Nemotron RAG family. It explains how this integration permits you to call the Nemotron RAG model directly out of your SQL Server database, turning it right into a high-performance AI application engine. The implementation relies on Azure Cloud and Azure Local to cover important SQL Server usage on cloud or on-premises. 

Solving enterprise AI RAG challenges with Nemotron RAG and SQL Server 2025

Connecting SQL Server 2025 to the flexible, accelerated NVIDIA AI engine with Nemotron RAG solves the core enterprise AI RAG challenges: performance, deployment, and suppleness and security.

Improve RAG performance bottlenecks 

This architecture solves the first RAG performance bottleneck by offloading embedding generation from CPUs to NVIDIA GPUs using Llama Nemotron Embed 1B v2. This can be a state-of-the-art open model for creating highly accurate embeddings optimized for retrieval tasks. It offers multilingual and cross-lingual text question-answering retrieval with long context support and optimized data storage​. 

Llama Nemotron Embed 1B v2 is a component of Nemotron RAG, which is a group of extraction, embedding, reranking models, fine-tuned with the Nemotron RAG datasets and scripts, to attain the perfect accuracy​. 

On the database side, SQL Server 2025 delivers seamless, high-performance data retrieval with vector search, powered by native vector distance functions. When hosting embedding models locally, you eliminate network overhead and cut latency, two key aspects that will deliver performance improvements.

Deploy AI models as easy, containerized endpoints 

Deployment is where NVIDIA NIM microservices are available in. NIM microservices are prebuilt, production-ready containers designed to streamline the deployment of the newest optimized AI models, like NVIDIA Nemotron RAG, across any NVIDIA-accelerated infrastructure whether within the cloud or on-premises. With NIM, you may deploy AI models as easy, containerized endpoints without the necessity for managing complex libraries or dependencies.

Also, data residency and compliance are addressed through locally hosted models powered by NIM microservices. Ease of use is one other key advantage. The prebuilt nature of NIM combined with native SQL REST APIs significantly reduces the training curve, making it easier to bring AI closer to the information customers have already got.

Maintain security and suppleness  

This architecture provides a portfolio of state-of-the-art Nemotron RAG models while keeping proprietary data secure inside your SQL Server database. NIM microservices are designed for enterprise-grade security and backed by NVIDIA enterprise support. All communications between NIM microservices and SQL Server is further secured using end-to-end HTTPS encryption. 

Nemotron RAG and Microsoft SQL Server 2025 reference architecture

The Nemotron RAG and SQL Server 2025 reference architecture details the implementation of the answer using the Llama Nemotron Embed 1B v2 embedding model, delivered as a NIM microservice. This permits enterprise-grade, secure, GPU-accelerated RAG workflows directly from SQL Server on Azure Cloud or Azure Local.

For the complete code, deployment scripts, and detailed walkthroughs for this solution, see NVIDIA NIM with SQL Server 2025 AI on Azure Cloud and Azure Local.

Core architecture components

Figure 1 shows the three core architecture components and flow foundation, that are also described intimately below.

Pipeline image showing NVIDIA NIM and SQL Server 2025 with three main areas (left to right): SQL Server 2025 AI, ACA On-premises, NIM Repository. Arrows indicate the flow of HTTPS requests/responses and image pulls between these areas.
Pipeline image showing NVIDIA NIM and SQL Server 2025 with three main areas (left to right): SQL Server 2025 AI, ACA On-premises, NIM Repository. Arrows indicate the flow of HTTPS requests/responses and image pulls between these areas.
Figure 1. This architecture consists of three core components that work together

SQL Server 2025: The AI-ready database

The muse of this solution is SQL Server 2025, which introduces two transformative capabilities that act because the engine for in-database AI:

  • Native vector data type: This feature allows you to securely store vector embeddings directly alongside structured data. It eliminates the necessity for a separate vector database, simplifying your architecture, reducing data movement, and enabling hybrid searches resembling finding products which can be each “trainers” (vector search) and “in stock” (structured filter).
  • Vector distance search: You’ll be able to now perform similarity searches directly inside SQL Server 2025 using built-in functions. This permits you to rank results by closeness in embedding space, enabling use cases like semantic search, suggestion systems, and personalization—all without leaving the database.
  • Create external model: Register and manage external AI models (NIM microservices, for instance) as first-class entities in SQL Server 2025. This provides a seamless approach to orchestrate inference workflows while keeping governance and security centralized.
  • Generate embeddings: Use the AI_GENERATE_EMBEDDINGS function to create embeddings for text or other data directly from T-SQL. This function leverages calling external REST APIs under the hood, enabling real-time embedding generation without complex integration steps.

NVIDIA NIM microservices: The accelerated AI engine

The Nemotron RAG family of open models, including the Llama Nemotron Embed 1B v2 model utilized in this reference architecture, are delivered as production-ready NVIDIA NIM microservices that run in standard Docker containers.  

This approach simplifies deployment and ensures compatibility across cloud and native Windows or Linux environments with NVIDIA GPUs. The models might be deployed on Azure Container Apps or on-premises with Azure Local. This containerized delivery supports each automatic and manual scaling strategies and provides the best “ground-to-cloud” flexibility to be used with SQL Server 2025. 

  • Cloud scale: You’ll be able to deploy NIM microservices to ACA with serverless NVIDIA GPUs. This approach abstracts all infrastructure management. You get on-demand, GPU-accelerated inference that scales to zero with per-second billing, optimizing costs and simplifying operations.
  • On-premises: For max data sovereignty and low-latency, you may run the identical NIM container on-premises using Azure Local with NVIDIA GPUs. Azure Local extends Azure’s management plane to your personal hardware, enabling you to run AI directly against your local data while meeting strict compliance or performance needs.

The link between SQL Server and NIM microservices

The communication bridge between SQL Server and the NIM microservice is straightforward and robust, built on standard, secure web protocols.

  • OpenAI-compatible API: NVIDIA NIM exposes an OpenAI-compatible API endpoint. This enables SQL Server 2025 to make use of its native functions to call the NIM service just as it might call an OpenAI service, ensuring seamless, out-of-the-box integration.
  • Standard POST requests: SQL Server 2025 issues standard HTTPS POST requests to retrieve results resembling embeddings. 
  • Secure and versatile communication: The design uses TLS certificates for end-to-end encryption, establishing mutual trust and ensuring all responses are secure, performant and standards-compliant for each cloud and on-premises deployments. This provides a major advantage over a remote-only model, as you keep full control, and proprietary data never leaves your secure environment.

While this reference architecture features the state-of-the-art Nemotron RAG models, it will possibly be prolonged to enable SQL Server 2025 to call any NIM microservice to power a broad range of AI applications—resembling text summarization, content classification, or predictive evaluation—all performed directly in your data in SQL Server 2025.

Two methods of deployment

This post covers the 2 primary deployment patterns for this solution:  on-premises (using Azure Local) and cloud (using Azure Container Apps). Each patterns depend on the identical core mechanism: SQL Server 2025 calling the NVIDIA NIM microservice endpoint using the usual OpenAI-compatible protocol

On-premises implementation with Azure Local

The on-premises implementation ensures maximum flexibility, supporting practical combos of Windows and Linux systems running on NVIDIA GPU-enabled servers resembling:

  • Windows/Ubuntu Server or Windows/Ubuntu on-premises virtual machine running each SQL Server and NVIDIA NIM
  • Windows running SQL Server and Ubuntu running NVIDIA NIM, or vice versa

To deploy, leverage Azure Local, the brand new Microsoft offering that extends the Azure Cloud platform directly into your on-premises environments. For full installation instructions for establishing secure communication including NIM deployment details, visit NVIDIA/GenerativeAIExamples on GitHub. Note that this solution was validated using SQL Server 2025 (RC 17.0.950.3).

Cloud implementation

The cloud deployment leverages NVIDIA Llama Nemotron Embedding NIM hosted on Azure Container Apps (ACA), Microsoft Azure’s fully managed serverless container platform. ACA fully supports and extends some great benefits of the proposed architecture. To learn more, see NVIDIA NIM with Microsoft SQL Server 2025 AI on Azure Cloud and Azure Local on the NVIDIA/GenerativeAIExamples GitHub repo.

This serverless approach provides several key benefits for deploying your AI applications with data stored in SQL Server 2025.

To speed up NIM replica startup, we recommend using ACA volumes backed by Azure File Share or ephemeral storage to persist the local NIM cache.The variety of replicas is managed routinely through ACA HTTP scaling, allowing you to scale to zero.

ACA applications can host multiple versions and kinds of NIM in parallel, each accessible through distinct URLs configured in SQL Server.

Solution demo 

To get complete instructions for running the complete end-to-end workflow, take a look at the demo SQL Server 2025 AI functionality with NVIDIA Retrieval QA using E5 Embedding v5

Specifically, the demo SQL scripts guide you thru the next steps: 

  • Create the AdventureWorks sample database
  • Create the ProductDescriptionEmbeddings demo table
  • Execute demo scripts to populate embeddings through the NVIDIA NIM integration
  • Confirm and visualize stored embeddings using Select_Embeddings.sql

This workflow demonstrates the brand new SQL Server 2025 AI capabilities, using the built-in T-SQL AI capabilities VECTOR_DISTANCEAI_GENERATE_EMBEDDINGS, and CREATE EXTERNAL MODEL, which form the muse of the brand new AI integration in SQL Server 2025.

Start with SQL Server 2025 and NVIDIA Nemotron RAG

The combination of Microsoft SQL Server 2025 with NVIDIA Nemotron RAG, delivered as production-grade NVIDIA NIM microservices, offers a seamless “ground-to-cloud” path for constructing high-performance AI applications. By combining the SQL Server 2025 built-in AI capabilities with the NVIDIA GPU-optimized inference stack, you may now solve the first RAG performance bottleneck, bringing AI on to their data—securely, efficiently, and without the operational complexity of managing data pipelines.

This joint reference architecture demonstrates how you may construct RAG applications that generate embeddings, perform semantic search, and invoke inference services directly inside SQL Server 2025. This approach delivers the pliability to deploy state-of-the-art models resembling NVIDIA Nemotron wherever the information lives—on Azure Cloud or on-premises with Azure Local—while preserving full data sovereignty. 

Ready to begin constructing? Get all deployment scripts, code samples, and detailed walkthroughs for each cloud and on-premises scenarios through NVIDIA NIM with Microsoft SQL Server 2025 AI on Azure Cloud and Azure Local on the NVIDIA/GenerativeAIExamples GitHub repo.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x