Why Batching Matters More On-Premises

When you serve large language models through a cloud API, batching happens behind the scenes. The provider absorbs the cost of idle GPU cycles between requests. On-premises, every idle millisecond hits your bottom line directly. A single H100 GPU drawing 700 watts while waiting for the next request is pure waste.

Inference batching groups multiple incoming requests together so the GPU processes them in a single forward pass. The arithmetic is compelling: a model that serves 15 tokens per second for a single request can often serve 8 concurrent requests at 12 tokens per second each, increasing aggregate throughput by roughly 6x with minimal per-request latency impact. The key is choosing the right batching strategy for your workload profile.

Static Batching: Simple but Limiting

Static batching collects a fixed number of requests and processes them together. The server waits until N requests arrive or a timeout expires, whichever comes first. This is the simplest approach and works well when request lengths are uniform and arrival rates are predictable.

The problem emerges with variable-length inputs and outputs. In a static batch, every request must wait for the longest sequence to complete before any results are returned. A short summarization task gets held hostage by a long code-generation request in the same batch. For interactive applications where users expect streaming responses, this creates unacceptable tail latency.

Static batching remains appropriate for offline batch processing jobs, such as nightly document classification or embedding generation, where latency is irrelevant and maximizing throughput is the only goal.

Continuous Batching: The Production Standard

Continuous batching, also called iteration-level batching or in-flight batching, solves the variable-length problem. Instead of waiting for an entire batch to finish, the scheduler inserts new requests into the batch at every decode step. When a request in the batch completes (generates its end-of-sequence token), its slot is immediately freed for a waiting request.

Serving frameworks like vLLM, TensorRT-LLM, and text-generation-inference all implement continuous batching. In practice, this means your GPU stays saturated even when request lengths vary wildly. A request that needs 20 tokens of output exits the batch quickly, and its slot is filled within one iteration cycle.

To configure continuous batching effectively on-premises, focus on three parameters:

Max batch size — The upper bound on concurrent sequences in a batch. Set this based on your GPU memory after accounting for model weights and KV cache. Overcommitting leads to out-of-memory errors; undercommitting leaves throughput on the table.

Max waiting time — How long a request can sit in the queue before being added to the next batch iteration. For real-time applications, keep this under 100ms. For background tasks, you can increase it to allow larger batches to form.

Preemption policy — When the batch is full and a higher-priority request arrives, should you preempt a running request? Frameworks like vLLM support swapping preempted requests to CPU memory, allowing priority-based scheduling without losing work.

PagedAttention and Memory-Efficient KV Caching

The biggest constraint on batch size is KV cache memory. Each active sequence maintains a key-value cache that grows with sequence length. Traditional implementations pre-allocate the maximum possible cache for each sequence, wasting memory on sequences that will be short.

PagedAttention, introduced by the vLLM project, applies the operating system concept of virtual memory paging to KV caches. Instead of contiguous pre-allocated blocks, the cache is divided into fixed-size pages allocated on demand. This eliminates internal fragmentation and can increase the effective batch size by 2-4x on the same hardware.

For on-premises deployments, this directly translates to serving more concurrent users per GPU. If your current setup handles 16 concurrent sequences with traditional KV caching, PagedAttention might push that to 40-50 sequences, depending on your average sequence length distribution. The framework handles paging transparently — you get the benefit by simply using a serving engine that supports it.

Priority Queuing and Multi-Tier Serving

Not all inference requests are equal. An executive dashboard generating real-time summaries needs sub-second response times. A nightly batch job processing thousands of support tickets can tolerate minutes of queuing. On-premises, where GPU capacity is fixed, a well-designed priority system is essential.

Implement at least two tiers: a real-time tier with strict latency SLOs and reserved GPU capacity, and a batch tier that absorbs remaining capacity. The batch tier acts as a natural buffer, expanding when real-time load is low and contracting during peak hours. This is functionally similar to how cloud spot instances work, except you control the scheduling logic entirely.

Use request metadata — source application, user tier, task type — to route incoming requests. A reverse proxy like Envoy or a dedicated inference gateway can handle this routing, applying rate limits per tier and ensuring the real-time tier never starves.

Measuring and Tuning Batch Performance

The metrics that matter for batching are time to first token (TTFT), inter-token latency (ITL), throughput (tokens per second across all requests), and GPU utilization. These metrics often compete: maximizing throughput increases TTFT as requests wait for batch formation.

Start by profiling your actual workload. Capture a week of production request logs including input lengths, output lengths, arrival timestamps, and source applications. Replay this traffic against your serving infrastructure with different batching configurations. Tools like the inference benchmarking scripts included with vLLM and TensorRT-LLM make this straightforward.

Watch for two common failure modes. First, batch starvation: during low-traffic periods, requests wait for a batch that never fills. Fix this with aggressive timeout settings or a minimum batch size of 1 during off-peak hours. Second, memory pressure cascading: under high load, large batches consume all KV cache memory, forcing preemptions that slow everything down. Set conservative max batch sizes and let the queue absorb bursts rather than cramming everything onto the GPU.

Batching configuration is not set-and-forget. As your model mix changes, as new applications come online, and as traffic patterns shift, revisit your settings quarterly. The difference between a well-tuned and poorly-tuned batching configuration on the same hardware can easily be 3-5x in effective throughput.

Featured image by CoolIT Systems on Unsplash.

AI-Driven Consulting

People & Culture

Academy

Who we are

What we do

Resources

Career

Search across SysArt

Inference Batching Strategies for On-Premises LLM Serving