Blog
Rate Limiting and Backpressure for On-Premises AI APIs
Practical patterns for protecting on-premises AI services from overload using rate limiting, backpressure, and load shedding strategies tailored to GPU-bound inference workloads.
Why AI APIs Need Different Protection Strategies
Traditional API rate limiting is well understood for web services: set a requests-per-second cap, return HTTP 429 when exceeded, and let clients retry. But on-premises AI inference APIs have characteristics that make standard rate limiting insufficient and sometimes counterproductive.
AI inference requests are not uniform in cost. A request that generates 50 tokens consumes a fraction of the GPU time and memory that a 4,000-token generation requires. Rate limiting by request count alone allows a small number of expensive requests to starve your GPU while the rate limiter reports headroom. Conversely, a burst of cheap classification requests might trip rate limits despite the GPU being underutilized.
GPU resources also fail differently than CPU-bound services. When a web server is overloaded, response times increase gradually. When a GPU runs out of memory during inference, the process crashes or the request fails entirely with an out-of-memory error. There is no graceful degradation curve; instead, there is a cliff. This makes proactive protection through backpressure and load shedding essential rather than optional.
Token-Aware Rate Limiting
The most effective rate limiting strategy for AI APIs uses token budgets rather than request counts. Each consumer (team, application, or API key) receives an allocation defined in tokens per minute or tokens per hour, covering both input and output tokens.
Implementing token-aware rate limiting requires estimating the cost of a request before it executes. For input tokens, this is straightforward: tokenize the input prompt and count. For output tokens, you need to use the max_tokens parameter from the request as the reservation. When the response completes, reconcile the reservation against actual usage and return unused tokens to the budget.
A practical implementation uses a sliding window token counter per consumer. When a request arrives, check whether the consumer's remaining token budget can accommodate the estimated cost. If yes, deduct the estimated tokens and process the request. If no, return a rate limit error that includes the reset time and remaining budget so the client can make intelligent retry decisions.
Set token budgets based on your GPU capacity. If your inference cluster can sustain 50,000 tokens per minute across all consumers, allocate budgets that sum to 70-80% of that capacity to leave headroom. Use the workload profiling data from your cluster to determine sustainable throughput at acceptable latency levels.
Implementing Backpressure Mechanisms
Backpressure is the practice of signaling upstream callers to slow down when the system approaches capacity. Unlike rate limiting, which enforces hard boundaries, backpressure provides graduated signals that allow callers to adjust their behavior before hitting limits.
The simplest backpressure mechanism is queue depth monitoring. Most inference servers (vLLM, TGI, Triton) maintain an internal request queue. Expose the current queue depth as a metric and include it in API response headers. Clients that observe increasing queue depth can proactively reduce their request rate without waiting for rate limit errors.
A more sophisticated approach uses adaptive concurrency limits. Instead of a fixed rate limit, dynamically adjust the number of concurrent requests allowed based on observed latency. The TCP Vegas algorithm provides a good model: when latency increases, reduce the concurrency limit; when latency is stable, gradually increase it. Libraries like Netflix's concurrency-limits implement this pattern and can be adapted for AI inference workloads.
For multi-model deployments, implement per-model backpressure. A single overloaded model should not cause backpressure on unrelated models sharing the same API gateway. Track queue depth and latency per model endpoint, and apply backpressure signals independently. This prevents a runaway request pattern on one model from degrading service for all consumers.
Load Shedding for GPU Protection
Load shedding is the deliberate rejection of requests to protect system stability when backpressure alone is insufficient. For GPU-bound AI workloads, load shedding is your last line of defense against out-of-memory crashes and cascading failures.
Design a tiered priority system for your AI requests. Not all requests carry equal business value. A customer-facing chatbot query is typically higher priority than an internal batch analysis job. Assign priority levels at the API key or consumer level, and when GPU memory or compute utilization exceeds a threshold (typically 85-90%), begin shedding lower-priority requests first.
Implement GPU memory-aware admission control. Before accepting a new inference request, check the current GPU memory utilization. If the estimated memory requirement for the new request (model activation memory plus KV cache for the expected sequence length) would push utilization above your safety threshold, reject the request immediately rather than risking an OOM failure that could affect all in-flight requests.
When shedding load, provide actionable error responses. Include the reason (memory pressure, compute saturation, or queue overflow), the estimated time until capacity is available, and whether the request should be retried or redirected. For environments with multiple inference endpoints, include routing hints that direct the client to a less loaded instance.
Request Costing and Fair Scheduling
Fair scheduling ensures that no single consumer can monopolize GPU resources at the expense of others. This is especially important in multi-tenant on-premises deployments where multiple teams share the same inference infrastructure.
Implement a weighted fair queue at the inference gateway. Each consumer's requests enter a separate queue, and the scheduler selects the next request to process based on each consumer's weight (their allocated share of total capacity) and recent usage. Consumers who have used less than their fair share get priority; consumers who have exceeded their share are deprioritized but not blocked.
Account for the actual cost of requests in the fairness calculation. A consumer that sends one request requiring 10,000 output tokens should not be treated the same as a consumer that sent 100 requests of 100 tokens each, even though the total token count is identical. Long-running requests tie up GPU memory for the entire generation duration, which has a different impact on cluster capacity than many short requests.
Publish a cost model to your API consumers so they can predict and optimize their usage. The cost model should express the GPU-seconds consumed per request as a function of input length, output length, and model size. When teams understand the true resource cost of their requests, they naturally optimize: using shorter prompts, requesting fewer output tokens, or routing simpler tasks to smaller models.
Monitoring and Tuning Your Protection Layer
Rate limiting and backpressure configurations are not set-and-forget. They require ongoing monitoring and adjustment as your workload patterns evolve. Build dashboards that track key protection metrics: rate limit hit rates per consumer, queue depth trends, load shedding frequency, and the gap between allocated and actual token usage.
A high rate limit hit rate for a specific consumer might indicate that their allocation is too low for legitimate use, or it might indicate a runaway process generating excessive requests. Investigate before adjusting limits.
Monitor load shedding events as capacity planning signals. If your system sheds load during predictable peak hours, you may need additional GPU capacity or better traffic shaping. If shedding occurs randomly, investigate whether specific request patterns are causing GPU memory spikes.
Test your protection mechanisms under controlled conditions. Run load tests that deliberately exceed your rate limits and trigger backpressure to verify that the system behaves as designed. Confirm that load shedding activates at the right thresholds and that high-priority requests continue to be served when lower-priority traffic is shed. These protection layers are only valuable if they work correctly when you actually need them.
Featured image by Daniel Julio on Unsplash.