Blog
QoS and Fairness for Shared On-Premises GPU Inference Clusters
How to prioritize workloads, prevent noisy-neighbor effects, and align batch policies when multiple teams share the same on-premises GPU fleet without turning operations into a constant negotiation.
Why shared GPUs need explicit quality-of-service design
Once an organization centralizes on-premises inference, the easy part is installing servers. The hard part is sustaining trust between teams when latency spikes on one workload starve another, or when a batch analytics job consumes so much memory that interactive chat sessions fail. Unlike generic Kubernetes CPU throttling, GPU sharing exposes sharp failure modes: out-of-memory kills, head-of-line blocking inside inference servers, and opaque queue depths that look fine until a demo day.
Quality of service for GPU inference is therefore a product of scheduling policy, server configuration, and organizational agreements expressed as code. Without that trio, “shared cluster” quickly becomes “shared blame.”
Partition strategies: from hard isolation to weighted sharing
The strongest isolation is dedicated hardware per critical service class. That conflicts with cost targets, so platforms often combine physical pools with logical separation. NVIDIA Multi-Instance GPU (MIG), when available on supported accelerators, splits a card into smaller profiles with bounded memory and compute. Where MIG is not an option, separate inference server instances per tier on the same host still reduce cross-talk compared to one monolithic process with a single queue.
Weaker but common patterns include multiple deployments behind a load balancer with separate Kubernetes priority classes and resource quotas. Those controls help Kubernetes placement but do not automatically translate to fair GPU time inside a long-running inference worker. You still need application-level request classes—interactive, standard batch, and opportunistic—with explicit limits on how many concurrent batch streams can occupy each class.
Queueing, backpressure, and user-visible behavior
Fairness is not only about average throughput; it is about what happens when demand exceeds capacity. Define behaviors up front: does interactive traffic preempt in-flight batch work? Are clients told to retry with exponential backoff, or do they sit in a bounded queue with deadlines? HTTP 429 responses and structured error bodies beat silent multi-second stalls.
Inference servers that support continuous batching need tuning parameters for maximum batch size, waiting time to form a batch, and eviction rules when sequences grow unexpectedly. Those knobs should differ by service class. An internal playbook that lists default values per tier—and who may change them—prevents midnight edits that help one team and surprise another.
Observability that engineering and finance can share
GPU utilization percentage alone misleads: a busy GPU may still deliver terrible tail latency if batches are dominated by one tenant. Track per-tenant or per-application signals such as queue time before first token, batch formation time, preemption counts, and out-of-memory events attributed to namespaces or API keys. Export those metrics to your existing Prometheus-style stack and connect them to chargeback or showback dashboards where your organization already runs GPU quotas.
Pair technical metrics with periodic capacity reviews that compare growth curves to procurement lead times. QoS policies buy time by smoothing contention, but they cannot create silicon. When sustained contention appears, the right outcome is often budget conversation, not endless tuning.
When fairness breaks: incident response
When tail latency spikes or out-of-memory events cluster around a single namespace or API key, treat the situation as a capacity and tenancy incident, not only an application defect. Runbooks should record who may temporarily throttle which tenant class, how to drain a node without dropping long-lived sessions where possible, and how internal consumers are notified of degraded service classes. Post-incident reviews should ask whether the failure mode was visible in metrics you already collect; if not, add a signal instead of layering another informal rule.
Correlate inference metrics with upstream dependencies—retrieval latency, authentication, and feature stores—so GPU queues are not blamed for problems rooted elsewhere. The goal is repeatable recovery: the same on-call engineer at three in the morning applies the same documented levers as during business hours.
Governance: SLOs, contracts, and escalation
Document service-level objectives per tier: for example, interactive workloads target bounded latency percentiles during business hours, while batch jobs accept overnight windows. Make those SLOs visible to product owners who sign up consumers. When conflicts arise—two “interactive” teams during the same launch—escalation paths should route through a platform governance forum with authority to reprioritize or temporarily carve out hardware.
Automation helps. Git-managed policies for rate limits, namespace quotas, and ingress priorities let changes stay reviewable. Avoid ad hoc kubectl tweaks that fix a demo but leave no audit trail.
Putting it together
Shared on-premises GPU clusters fail socially before they fail technically. Invest early in isolation mechanisms that match your risk, queueing policies that fail visibly, metrics that expose tenant-level experience, and governance that connects SLOs to funding decisions. The result is a platform where teams argue in data instead of anecdotes—and where inference capacity feels predictable even when it is fully subscribed.
Featured image by Avi Waxman on Unsplash.