Architecting GPUaaS for Enterprise AI On-Prem

AI is evolving rapidly, and software engineers not have to memorize syntax. Nonetheless, pondering like an architect and understanding the technology that permits systems to run securely at scale is becoming increasingly precious.

I also need to reflect on being in my role a 12 months now as an AI Solutions Engineer at Cisco. I work with customers each day across different verticals — healthcare, financial services, manufacturing, law firms, they usually are all attempting to answer largely the identical set of questions:

These are the true practical constraints that show up immediately once you are trying to operationalize AI beyond a POC.

Recently, we added a Cisco UCS C845A to one in every of our labs. It has 2x NVIDIA RTX PRO 6000 Blackwell GPUs, 3.1TB NVMe, ~127 allocatable CPU cores, and 754GB RAM. I made a decision to construct a shared internal platform on top of it — giving teams a consistent, self-service environment to run experiments, validate ideas, and construct hands-on GPU experience.

I deployed the platform as a Single Node OpenShift (SNO) cluster and layered a multi-tenant GPUaaS experience on top. Users reserve capability through a calendar UI, and the system provisions an isolated ML environment prebuilt with PyTorch/CUDA, JupyterLab, VS Code, and more. Inside that environment, users can run on-demand inference, iterate on model training and fine-tuning, and prototype production grade microservices.

This post walks through the architecture — how scheduling decisions are made, how tenants are isolated, and the way the platform manages itself. The selections that went into this lab platform are the identical ones any organization faces once they’re serious about AI in production.

Multi-agent architectures, self-service experimentation, secure multi-tenancy, cost-predictable GPU compute, all of it starts with getting the platform layer right.

High level platform architecture diagram. Image created by writer.

Initial Setup

Before there’s a platform, there’s a bare metal server and a blank screen.

Bootstrapping the Node

The node ships with no operating system. Whenever you power it on you’re dropped right into a UEFI shell. For OpenShift, installation typically starts within the Red Hat Hybrid Cloud Console via the Assisted Installer. The Assisted Installer handles cluster configuration through a guided setup flow, and once complete, generates a discovery ISO — a bootable RHEL CoreOS image preconfigured to your environment. Map the ISO to the server as virtual media through the Cisco IMC, set boot order, and power on. The node will phone home to the console, and you possibly can kick off the installation process. The node writes RHCOS to NVMe and bootstraps. Inside a couple of hours you’ve gotten a running cluster.

This workflow assumes web connectivity, pulling images from Red Hat’s registries during install. That’s not at all times an option. A lot of the purchasers I work with operate in air-gapped environments where nothing touches the general public web. The method there’s different: generate ignition configs locally, download the OpenShift release images and operator bundles ahead of time, mirror all the things into a neighborhood Quay registry, and point the install at that. Each paths get you to the identical place. The assisted install is far easier. The air-gapped path is what production looks like in regulated industries.

Configuring GPUs with the NVIDIA GPU Operator

Once the GPU Operator is installed (happens mechanically using the assisted installer), I configured how the 2 RTX PRO 6000 Blackwell GPUs are presented to workloads through two ConfigMaps within the nvidia-gpu-operator namespace.

The primary — custom-mig-config — defines physical partitioning. On this case it’s a mixed strategy, meaning GPU 0 is partitioned into 4 1g.24gb MIG slices (~24GB dedicated memory each), GPU 1 stays whole for workloads that need the complete ~96GB. MIG partitioning is real hardware isolation. You get dedicated memory, compute units, and L2 cache per slice. Workloads will see MIG instances as separate physical devices.

The second — device-plugin-config — configures time-slicing, which allows multiple pods to share the identical GPU or MIG slice through rapid context switching. I set 4 replicas per whole GPU and a pair of per MIG slice. That is what enables running multiple inference containers side by side inside a single session.

Foundational Storage

The three.1TB NVMe is managed by the LVM Storage Operator (lvms-vg1 StorageClass). I created two PVCs as an element of the initial provisioning process — a volume backing PostgreSQL and protracted storage for OpenShift’s internal image registry.

With the OS installed, network prerequisites met (DNS, IP allocation, all required A records) which isn’t covered in this text, GPUs partitioned, and storage provisioned, the cluster is prepared for the applying layer.

System Architecture

This leads us into the major topic: the system architecture. , with the PostgreSQL database as the one source of truth.

Within the platform management namespace, there are 4 at all times on deployments:

Portal app: a single container running the React UI and FastAPI backend
Reconciler (controller): the control loop that repeatedly converges cluster state to match the database
PostgreSQL: persistent state for users, reservations, tokens, and audit history
Cache daemon: a node-local service that pre-stages large model artifacts / inference engines so users can start quickly (pulling a 20GB vLLM image over corporate proxy can take hours)

A fast note on the event lifecycle, since it’s easy to complicate shipping Kubernetes systems. I write and test code locally, but the photographs are inbuilt the cluster using OpenShift construct artifacts (BuildConfigs) and pushed to the interior registry. The deployments themselves just point at those images.

The primary time a component is introduced, I apply the manifests to create the Deployment/Service/RBAC. After that, most changes are only constructing a brand new image in-cluster, then trigger a restart so the Deployment pulls the updated image and rolls forward:

oc rollout restart deployment/ -n

That’s the loop: commit → in-cluster construct → internal registry → restart/rollout.

The Scheduling Plane

That is the user facing entry point. Users see the resource pool — GPUs, CPU, memory, they pick a time window, select their GPU allocation mode (more on this later), and submit a reservation.

GPUs are expensive hardware with an actual cost per hour whether or not they’re in use or not. The reservation system treats calendar time and physical capability as a combined constraint. The identical way you’d book a conference room, except this room has 96GB of VRAM and costs considerably more per hour.

Under the hood, the system queries overlapping reservations against pool capability using advisory locks to stop double booking. Essentially it’s just adding up reserved capability and subtracting it from total capability. Each reservation tracks through a lifecycle: APPROVED → ACTIVE → COMPLETED, with CANCELED and FAILED as terminal states.

The FastAPI server itself is intentionally thin. It validates input, persists the reservation, and returns. It never talks to the Kubernetes API.

The Control Plane

At the guts of the platform is the controller. It’s Python based and runs in a continuous loop on a 30-second cadence. You’ll be able to consider it like a cron job when it comes to timing, but architecturally it’s a Kubernetes-style controller chargeable for driving the system toward a desired state.

The database holds the specified state (reservations with time windows and resource requirements). The reconciler reads that state, compares it against what actually exists within the Kubernetes cluster, and converges the 2. There aren’t any concurrent API calls racing to mutate cluster state; only one deterministic loop making the minimum set of changes needed to achieve the specified state. If the reconciler crashes, it restarts and continues exactly where it left off, since the source of truth (desired state) stays intact within the database.

Each reconciliation cycle evaluates 4 concerns so as:

Stop expired or canceled sessions and delete the namespace (which cascades cleanup of all resources inside it).
Repair failed sessions and take away orphaned resources left behind by partially accomplished provisioning.
Start eligible sessions when their reservation window arrives — provision, configure, and hand the workspace to the user.
Maintain the database by expiring old tokens and enforcing audit log retention.

Starting a session is a multi-step provisioning sequence, and each step is idempotent, meaning it’s designed to be safely re-run if interrupted midway:

Controller in depth. Image created by writer.

The reconciler is the component that talks to the Kubernetes API.

Garbage collection can also be baked into the identical loop. At a slower cadence (~5 minutes), the reconciler sweeps for cross namespace orphans resembling stale RBAC bindings, leftover OpenShift security context entries, namespaces stuck in terminating, or namespaces that exist within the cluster but haven’t any matching database record.

For instance, we had an influence supply failure on the node that took the cluster down mid-session and when it got here back, the reconciler resumed its loop, detected the state discrepancies, and self-healed without manual intervention.

The Runtime Plane

When a reservation window starts, the user opens a browser and lands in a full VS Code workspace (code-server) pre-loaded with the whole AI/ML stack, and kubectl access inside their session namespace.

Workspace screenshot. Image taken by writer.

Popular inference engines resembling vLLM, Ollama, TGI, and Triton are already cached on the node, so deploying a model server is a one-liner that starts in seconds. There’s 600GB of persistent NVMe backed storage allocated to the session, including a 20GB home directory for notebooks and scripts, and a 300GB model cache.

Each session is a completely isolated Kubernetes namespace, its own blast radius boundary with dedicated resources and 0 visibility into another tenant’s environment. The reconciler provisions namespace scoped RBAC granting full admin powers inside that boundary, enabling users to create and delete pods, deployments, services, routes, secrets — regardless of the workload requires. But there’s no cluster level access. Users can read their very own ResourceQuota to see their remaining budget, but they’ll’t modify it.

ResourceQuota enforces a tough ceiling on all the things. A runaway training job can’t OOM the node. A rogue container can’t fill the NVMe. LimitRange injects sane defaults into every container mechanically, so users can kubectl run without specifying resource requests. There may be a proxy ConfigMap injected into the namespace so user deployed containers get corporate network egress without manual configuration.

Users deploy what they need — inference servers, databases, custom services, and the platform handles the guardrails.

When the reservation window ends, the reconciler deletes the namespace and all the things inside it.

GPU Scheduling

Node multi-tenancy diagram. Image created by writer.

Now the fun part — GPU scheduling and truly running hardware-accelerated workloads in a multi-tenant environment.

MIG & Time-slicing

We covered the MIG configuration within the initial setup, but it surely’s price revisiting from a scheduling perspective. GPU 0 is partitioned into 4 1g.24gb MIG slices — each with ~24GB of dedicated memory, enough for many 7B–14B parameter models. GPU 1 stays whole for workloads that need the complete ~96GB VRAM for model training, full-precision inference on 70B+ models, or anything that simply doesn’t slot in a slice.

The reservation system tracks these as distinct resource types. Users book either nvidia.com/gpu (whole) or nvidia.com/mig-1g.24gb (as much as 4 slices). The ResourceQuota for every session hard denies the alternative type. For those who reserved a MIG slice, you physically cannot request a complete GPU, even when one is sitting idle. In a mixed MIG environment, letting a session by accident eat the improper resource type would break the capability math for each other reservation on the calendar.

In our configuration, 1 whole GPU appears as 4 schedulable resources. Each MIG slice appears as 2.

What meaning is a user reserves one physical GPU and may run as much as 4 concurrent GPU-accelerated containers inside their session — a vLLM instance serving gpt-oss, an Ollama instance with Mistral, a TGI server running a reranker, and a custom service orchestrating across all three.

Two Allocation Modes

At reservation time, users select how their GPU budget is initially distributed between the workspace and user deployed containers.

Interactive ML — The workspace pod gets a GPU (or MIG slice) attached directly. The user opens Jupyter, imports PyTorch, and has immediate CUDA access for training, fine-tuning, or debugging. Additional GPU pods can still be spawned via time-slicing, however the workspace is consuming one in every of the virtual slots.

Inference Containers — The workspace is lightweight with no GPU attached. All time-sliced capability is on the market for user deployed containers. With a complete GPU reservation, that’s 4 full slots for inference workloads.

There may be an actual throughput tradeoff with time-slicing, workloads share VRAM and compute bandwidth. For development, testing, and validating multi-service architectures, which is precisely what this platform is for, it’s the correct trade-off. For production latency sensitive inference where every millisecond of p99 matters, you’d use dedicated slices 1:1 or whole GPUs.

GPU “Tokenomics”

One in all the primary questions within the introduction was: To reply that, you’ve gotten to begin with what the workload actually looks like in production.

What Real Deployments Look Like

Once I work with customers on their inference architecture, no one is running a single model behind a single endpoint. The pattern that keeps emerging is a fleet of models sized to the duty. You will have a 7B parameter model handling easy classification and extraction, runs comfortably on a MIG slice. A 14B model doing summarization and general purpose chat. A 70B model for complex reasoning and multi-step tasks, and perhaps a 400B model for the toughest problems where quality is non-negotiable. Requests get routed to the suitable model based on complexity, latency requirements, or cost constraints. You’re not paying 70B-class compute for a task a 7B can handle.

In multi-agent systems, this gets more interesting. Agents subscribe to a message bus and sit idle until called upon — a pub-sub pattern where context is shared to the agent at invocation time and the pod is already warm. There’s no cold start penalty since the model is loaded and the container is running. An orchestrator agent evaluates the inbound request, routes it to a specialist agent (retrieval, code generation, summarization, validation), collects the outcomes, and synthesizes a response. 4 or five models collaborating on a single user request, each running in its own container throughout the same namespace, communicating over the interior Kubernetes network.

Network policies add one other dimension. Not every agent must have access to each tool. Your retrieval agent can confer with the vector database. Your code execution agent can reach a sandboxed runtime. However the summarization agent has no business touching either, it receives context from the orchestrator and returns text. Network policies implement these boundaries on the cluster level, so tool access is controlled by infrastructure, not application logic.

That is the workload profile the platform was designed for. MIG slicing allows you to right size GPU allocation per model, a 7B doesn’t need 96GB of VRAM. Time-slicing lets multiple agents share the identical physical device. Namespace isolation keeps tenants separated while agents inside a session communicate freely. The architecture directly supports these patterns.

Quantifying It

To maneuver from architecture to business case, I developed a framework that reduces infrastructure cost to a single comparable unit: cost per million tokens. Each token carries its amortized share of hardware capital (including workload mix and redundancy), maintenance, power, and cooling. The numerator is your total annual cost. The denominator is what number of tokens you really process, which is entirely a function of utilization.

Utilization is essentially the most powerful lever on per-token cost. It doesn’t reduce what you spend, the hardware and power bills are fixed. What it does is spread those fixed costs across more processed tokens. A platform running at 80% utilization produces tokens at nearly half the unit cost of 1 at 40%. Same infrastructure, dramatically different economics. Because of this the reservation system, MIG partitioning, and time-slicing matter beyond UX — they exist to maintain expensive GPUs processing tokens during as many available hours as possible.

Since the framework is algebraic, you may as well solve in the opposite direction. Given a known token demand and a budget, solve for the infrastructure required and immediately see whether you’re over-provisioned (burning money on idle GPUs), under-provisioned (queuing requests and degrading latency), or right-sized.

For the cloud comparison, providers have already baked their utilization, redundancy, and overhead into per-token API pricing. For consistent enterprise GPU demand, the type of steady-state inference traffic these multi-agent architectures generate, on-prem wins.

Cloud token costs in multi-agent environments scale parabolically.

Nonetheless, for testing, demos, and POCs, cloud is cheaper.

Engineering teams often have to justify spend to finance with clear, defensible numbers. The tokenomics framework bridges that gap.

Conclusion

At first of this post I listed the questions I hear from customers consistently — AI strategy, use-cases, cloud vs. on-prem, cost, security. All of them eventually require the identical thing: a platform layer that may schedule GPU resources, isolate tenants, and provides teams a self-service path from experiment to production without waiting on infrastructure.

That’s what this post walked through. Not a product and never a managed service, but an architecture built on Kubernetes, PostgreSQL, Python, and the NVIDIA GPU Operator — running on a single Cisco UCS C845A with two NVIDIA RTX PRO 6000 Blackwell GPUs in our lab. It’s a practical start line that addresses scheduling, multi-tenancy, cost modeling, and the day-2 operational realities of keeping GPU infrastructure reliable.

This isn’t as intimidating because it looks. The tooling is mature, and you possibly can assemble a cloud-like workflow with familiar constructing blocks: reserve GPU capability from a browser, drop into a completely loaded ML workspace, and spin up inference services in seconds. The difference is where it runs — on infrastructure you own, under your operational control, with data that never leaves your 4 partitions. In practice, the barrier to entry is commonly lower than leaders expect.

Scale this to multiple Cisco AI Pods and the scheduling plane, reconciler pattern, and isolation model carry over directly. The inspiration is similar.

For those who’re working through these same decisions — tips on how to schedule GPUs, tips on how to isolate tenants, tips on how to construct the business case for on-prem AI infrastructure, I’d welcome the conversation.

Architecting GPUaaS for Enterprise AI On-Prem