Speed up Token Production in AI Factories Using Unified Services and Real-Time AI

-


In today’s AI factory environment, performance will not be theoretical. It’s economic, competitive, and existential. A 1% drop in usable GPU time can mean tens of millions of tokens lost per hour. Minutes of congestion can cascade into hours of recovery. A rack-level power oversubscription can result in stranded power and reduced tokens per watt, silently eroding factory output at scale. As AI factories scale to hundreds of GPUs running diverse mission critical workloads, the associated fee of unpredictable congestion, power constraints, long-tail latency, and limited visibility grows exponentially.

Operations teams and administrators need greater than dashboards. They need flexibility and foresight.

NVIDIA launched NVIDIA Mission Control as an integrated software stack for AI factories built on NVIDIA reference architectures, codifying NVIDIA best practices with a unified control plane. Mission Control version 3.0 expands further, introducing architectural flexibility, multi-org isolation, intelligent power orchestration and predictive AIOps to detect anomalies in operations and maximize token production.

Flexible software that unlocks velocity

NVIDIA Mission Control 3.0 provides newfound agility by introducing a brand new layered, API-driven architecture built on modular services, improving the previously tightly coupled stacks that required synchronized releases and complicated validation across hardware platforms. Latest components corresponding to automated network management and domain power service, which provides a brand new management plane for power optimizations, further extend the Mission Control stack by bringing additional modular services into the singular control plane.

By combining open components with a modular design, this allows rapid support for the most recent NVIDIA hardware while allowing OEM system providers and independent software vendors (ISVs) to integrate Mission Control capabilities directly into their very own ecosystems. This creates an end result where enterprises now have more flexibility and alternative in their very own software stacks, making it easier to customize solutions to fulfill their unique business and technology challenges. 

Isolation in a multi-tenant world

One technological challenge many organizations face is supporting multi-org isolation inside a centralized AI factory. As AI factories evolve from research and experimentation into production-grade, mission-critical environments, shared infrastructure across multiple teams requires strong organizational isolation and secure multi-tenancy.

The improved Mission Control control plane transforms the AI factory management stack right into a software-defined, virtualized architecture. Mission Control services are decoupled from physical management nodes and deployed on Virtual Machine (KVM)-based platforms using NVIDIA-provided automation. While compute racks and management nodes are dedicated per org, network switches are shared and require additional isolation for multi-tenancy. The shared fabric architecture of NVIDIA Spectrum-X Ethernet is logically segmented using VXLAN and NVIDIA Quantum InfiniBand is segmented using PKeys.

This architecture reduces physical management infrastructure footprint, establishes hard tenant isolation, and creates a secure foundation for multi-organization AI factories. This in turn lowers the full cost of ownership by allowing operators the pliability to onboard multiple orgs onto shared infrastructure, reducing the necessity to buy and operate multiple clusters lowering physical footprint, while still providing each org with strong isolation and self-service.

Power: The invisible constraint

One other growing concern for AI factory token production is fixed power envelopes on account of economic constraints corresponding to fixed utilities and regulatory compliance. Each GPU generation delivers more performance, but facility power is of course limited by a mix of the prevailing data center infrastructure and available power grid. The challenge is obvious: How do you increase token output and rack density without exceeding power limits?

The facility management in previous iterations of Mission Control helped organizations responsibly manage complex power considerations, nevertheless it was reactive. Jobs were scheduled first; power policies were enforced afterward. While this was an enormous step for balancing power and performance, more dynamic solutions were needed to administer this at scale, especially across mixed Slurm and Kubernetes environments. That is where Mission Control evolves with version 3.0.

By incorporating domain power service directly into Mission Control, power becomes a first-class scheduling primitive that helps organizations optimize token production with their power policies. This power management service enables power-aware workload placement across traditional Slurm workloads or Kubernetes-native workloads being orchestrated by NVIDIA Run:ai, which is integrated and included into the Mission Control stack. Domain power service also supports MAX-P and MAX-Q profiles for training and inference, provides rack- and topology-aware reservation steering by leveraging Mission Control integration with facility constructing management systems. 

In a single example where NVIDIA had MAX-Q profile in operation, domain power service allowed the information center to run at 85% power with only 7% throughput loss. It was capable of achieve this by dynamically leveraging the facility profiles integrated by Mission Control.

The combination empowers data center operators to define facility constraints and AI practitioners can confidently select performance or efficiency modes aligned to their workload priorities. Governance stays centralized while flexibility ensures AI factories could be tuned for best performance per watt and performance per dollar.

From dashboards to real-time decisions

Along with providing latest services for dynamic power management, Mission Control version 3.0 enhances existing anomaly detection capabilities by integrating with NVIDIA AIOps Collector and Platform Stacks (NACPS) for AI-powered predictive anomaly detection. On the core of NACPS is the AI cluster model, a graph-based representation of infrastructure and workloads that creates a topology-aware view across GPUs, NVIDIA NVLink scale-up, NVIDIA Spectrum-X Ethernet or NVIDIA Quantum InfiniBand East-West scale-out and NVIDIA BlueField DPU North-South networking. This view is combined with job topology within the cluster model. 

NACPS combines unsupervised online machine learning on metrics, natural language processing (NLP)-based evaluation of logs to detect unknown issues, supervised learning trained on labeled incidents, and deterministic rule-based guardrails. 

Telemetry streams constantly from GPUs, switches, hosts, network interface cards (NICs) and schedulers into NACPS. Events and anomalies are mechanically correlated across layers, enabling context-driven root cause evaluation while reducing alert noise. As an alternative of isolated metrics, the system understands relationships.

When anomalies are detected, Mission Control can trigger automated remediation workflows from automated hardware recovery that works in concert with Slurm integration in NVIDIA Base Command Manager or NVIDIA Run:ai for Kubernetes workloads. 

The system doesn’t just monitor infrastructure. It understands it and acts on it.

Operators now not must chase symptoms. They gain foresight.

A special form of KPI: Utilization vs. token production

As AI factory operations proceed to evolve, operation teams need to think about a distinct form of KPI. Traditional datacenters were optimized for utilization, but AI factories have to be optimized for token production.

To ensure that AI factories to be optimized for token production, enterprises need to think about metrics corresponding to: token production per GPU and per rack, in addition to token production per watt and megawatt. Every inefficiency directly reduces overall token output. If congestion within the network fabric isn’t detected and mitigated, or a single rack unexpectedly exceeds its power constraint, or a compute node experiences an anomaly mid-job — the AI factory loses out on token generation and potential revenue.

Nevertheless, when the AI factory is working intelligently, it’s capable of convert every megawatt into tokens with precision, maximizing output.

Start with Mission Control

Mission Control 3.0 is designed around minimizing inefficiencies and increasing token output for AI factory operators. By correlating telemetry across domains, orchestrating power intelligently, modularizing the architecture for agility, and enhancing autonomous remediation with AI, it transforms infrastructure from a passive platform into an lively participant in performance optimization.

Resources:

Stay tuned for our latest release notes and implementation guides for NVIDIA Mission Control 3.0.

It’s also possible to take a look at the on-demand replay for the NVIDIA GTC 2026 session with Eli Lilly & Company to listen to firsthand insights into architecting and deploying high-performance AI infrastructure with powerful, intelligent software.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x