The cold-start problem in on-premises LLM serving

Loading a large language model into GPU memory is not instantaneous. A 7-billion parameter model in FP16 requires approximately 14 GB of GPU memory. Loading those weights from disk, transferring them to the GPU, and initializing the inference runtime can take 30 seconds to several minutes depending on storage speed, PCIe bandwidth, and model architecture. For larger models in the 30B to 70B range, cold-start times of 5 to 10 minutes are common on standard enterprise hardware.

In cloud environments, this problem is mitigated by keeping instances perpetually warm and relying on the provider's elastic scaling to absorb demand. On-premises environments face tighter constraints. GPU resources are finite and shared across teams and workloads. Keeping every model loaded at all times is often not feasible when you have dozens of fine-tuned variants, multiple model families, or GPU memory pressure from training workloads running on the same infrastructure.

The result is a direct tension between resource efficiency and response latency. Aggressive model eviction keeps GPU utilization high but creates latency spikes when evicted models are requested. Conservative eviction wastes expensive GPU capacity on idle models. Effective cold-start optimization resolves this tension by making model loading faster, smarter, or unnecessary.

Memory-mapped model loading

The most impactful single optimization for cold-start latency is memory-mapped file I/O for model weights. Instead of reading the entire model file into CPU memory and then transferring to GPU, memory mapping creates a virtual memory mapping to the model file on disk. Pages are loaded on demand as they are accessed, and the operating system's page cache provides transparent caching of frequently accessed weights.

In practice, this means the model becomes usable before the entire file is loaded. The inference runtime can begin processing the first layers of the model while later layers are still being paged in from disk. For a 14 GB model on NVMe storage, memory-mapped loading can reduce the time to first token from 45 seconds to under 10 seconds, because the model only needs the first few transformer layers loaded to begin processing the input.

To maximize the benefit, store model files in a format that supports efficient memory mapping. SafeTensors is designed specifically for this: it uses a flat memory layout with a header that describes tensor locations, allowing individual tensors to be loaded independently without parsing the entire file. The GGUF format used by llama.cpp also supports memory mapping natively.

Pair memory-mapped loading with high-speed local storage. The page fault latency when accessing unmapped pages is determined by your storage throughput. NVMe SSDs with sequential read speeds of 3 to 7 GB/s make memory-mapped loading practical. Spinning disks or network-attached storage will negate the benefits entirely. If your on-premises infrastructure uses shared storage, consider maintaining a local NVMe cache on each GPU server specifically for model files.

Predictive model preloading

If you can predict which models will be needed before they are requested, you can begin loading them proactively and eliminate cold-start latency entirely for most requests. This requires analyzing your workload patterns to build a preloading strategy.

Start by profiling your model access patterns over several weeks. Most enterprise deployments show strong temporal patterns: certain models are used primarily during business hours, others spike during batch processing windows, and some are tied to specific applications that have predictable usage schedules. Use this historical data to build a time-based preloading schedule that loads models before their expected usage window.

For less predictable workloads, implement request-pattern-based preloading. If your system serves multiple models through a gateway, analyze the sequence of model requests from individual users or applications. If users who request Model A frequently request Model B within the next few minutes, preload Model B when you see a Model A request. This is analogous to CPU cache prefetching but at the model serving level.

A practical implementation uses a lightweight preloading daemon that maintains a priority queue of models ranked by predicted demand. It continuously loads the highest-priority model that is not already in GPU memory, evicting the lowest-priority loaded model if memory is full. The priority function combines time-based predictions, recent request patterns, and model size (smaller models are cheaper to load, so they get a priority boost for marginal cases).

Warm pool management

A warm pool is a set of model instances that are kept loaded in GPU memory and ready to serve requests immediately. Managing the warm pool effectively is the core challenge of cold-start optimization: you want the right models loaded at the right time.

The simplest warm pool strategy is LRU (Least Recently Used) eviction: when GPU memory is full and a new model needs to be loaded, evict the model that has gone the longest without receiving a request. LRU works reasonably well for workloads with temporal locality but fails for periodic workloads. A model used every morning at 9 AM will be evicted by afternoon and face a cold start again the next morning.

A more sophisticated approach is frequency-weighted LRU that factors in both recency and frequency of use. Models that are used regularly but infrequently (like a daily batch job) maintain a higher eviction resistance than models that received a burst of requests but are unlikely to be needed again soon. The LFU-aging algorithm provides a good balance: it tracks request counts but decays them over time, preventing historically popular but currently unused models from permanently occupying GPU memory.

Consider also the cost of eviction and reloading when making eviction decisions. A 1B parameter model that takes 3 seconds to reload is a much cheaper eviction than a 70B model that takes 8 minutes. Weight your eviction decisions by reload cost, not just by access patterns. This means preferring to evict small models even if they were used more recently than a large model, because the penalty for being wrong is much smaller.

Checkpoint-based fast recovery

Even with optimized loading, transferring model weights from disk to GPU memory involves serialization, memory allocation, and runtime initialization overhead. GPU memory checkpointing bypasses much of this by saving and restoring the complete GPU memory state directly.

The concept is similar to process hibernation in operating systems. When a model is evicted from GPU memory, instead of discarding the GPU state entirely, serialize the GPU memory contents to a fast local store. When the model is needed again, restore the serialized state directly to GPU memory without re-parsing the model file or re-initializing the runtime. This can reduce reload times by 60 to 80 percent compared to loading from the original model file.

NVIDIA's CUDA checkpoint/restore capabilities and tools like CRIU (Checkpoint/Restore In Userspace) with GPU extensions support this approach. The checkpoint file is larger than the original model file because it includes runtime state, KV cache allocations, and CUDA context, so you need fast storage with sufficient capacity. A dedicated NVMe partition for model checkpoints works well.

The tradeoff is storage space versus reload speed. Keeping checkpoints for all models that might be needed consumes significant NVMe capacity. A practical approach is to maintain checkpoints only for models above a size threshold, say 10B parameters, where the reload time savings justify the storage cost. Smaller models load fast enough from their original files that checkpointing provides marginal benefit.

Architectural patterns for minimizing cold starts

Beyond per-model optimizations, architectural decisions at the system level can reduce or eliminate cold-start exposure for end users.

Request queuing with estimated wait times converts cold starts from failures into managed delays. When a request arrives for an unloaded model, queue it, begin loading the model, and return an estimated time to the client. The client can decide whether to wait, retry later, or fall back to an alternative. This is far better than a timeout or error response, and it gives the system time to load without pressure to return a degraded result.

Model sharding across GPU pools distributes the warm pool across multiple servers. If you have four GPU servers, each maintains a warm pool of different models. A routing layer directs requests to the server that has the requested model loaded. This multiplies your effective warm pool size by the number of servers without requiring each server to have enough GPU memory for all models. KServe and Seldon Core support this routing pattern in Kubernetes-based on-premises deployments.

Speculative execution with a small default model provides immediate responses while the requested model loads. When a request arrives for a cold model, a small, always-loaded base model begins processing the request immediately. If the full model loads before the base model finishes, the response is generated by the full model. If the base model finishes first, its response is returned with a quality indicator, and the full model's response is provided asynchronously if needed. This pattern works particularly well for interactive applications where perceived latency matters more than optimal quality on every response.

Featured image by Rémy on Unsplash.

AI-Driven Consulting

People & Culture

Academy

Who we are

What we do

Resources

Career

Search across SysArt

Cold-Start Optimization Strategies for On-Premises LLM Serving