Blog
Checkpoint and Model Storage Architecture for On-Premises AI
Design patterns for storing, versioning, and recovering large model checkpoints on premises, addressing the unique storage challenges of AI workloads that traditional backup systems were not built for.
Why Traditional Storage Falls Short for AI Workloads
A single 7B-parameter model in full precision occupies roughly 28 GB. Training that model produces checkpoints every few hundred steps — each checkpoint the same size. A fine-tuning run with 20 saved checkpoints consumes 560 GB before you account for optimizer states, which can triple the per-checkpoint size. Scale this to multiple teams running experiments on different models, and an organization can easily accumulate tens of terabytes of model artifacts in weeks.
Traditional enterprise storage and backup systems were designed for databases, documents, and application data. They optimize for small random reads, incremental backups of changed blocks, and deduplication of text-heavy data. Model checkpoints are the opposite: massive single files, written sequentially at high throughput, with binary content that resists deduplication. Treating model storage as an afterthought leads to training jobs that stall waiting for checkpoint writes, storage systems that fill up unexpectedly, and recovery processes that take hours when they need to take minutes.
A deliberate storage architecture for AI workloads addresses three concerns: write throughput during training, efficient versioning and retrieval, and reliable recovery after failures.
Storage Tiers for Model Lifecycle Stages
Not all model artifacts have the same access patterns. A tiered storage strategy matches performance to need and avoids paying (in hardware cost and rack space) for fast storage that holds cold data.
Hot tier: active training and serving. Checkpoints being written during training and models loaded for inference need high-throughput, low-latency storage. NVMe SSDs directly attached to GPU nodes or a parallel filesystem like Lustre, BeeGFS, or GPFS provide the throughput — typically 5-20 GB/s — needed to write checkpoints without stalling training. This tier holds only the current training run's checkpoints and the models actively loaded for inference.
Warm tier: recent experiments and staging. Completed training runs, candidate models being evaluated, and adapters waiting for deployment live on network-attached storage with moderate throughput. Object storage solutions like MinIO provide S3-compatible APIs with reasonable performance and the ability to scale capacity independently of compute. Access frequency is weekly or daily, and retrieval latency of seconds is acceptable.
Cold tier: archival and compliance. Older model versions retained for reproducibility, audit, or rollback live on the cheapest available storage. Tape libraries, high-density spinning disk arrays, or deeply compressed object storage work here. Access is rare — monthly or less — and retrieval latency of minutes to hours is tolerable. Regulatory requirements in industries like finance and healthcare may mandate retention of model artifacts for years, making cold storage economics important.
Automate tier migration based on policies. A checkpoint should move from hot to warm when the training run completes, and from warm to cold after a configurable retention period (30-90 days is typical). Tools like MinIO's lifecycle policies or custom scripts triggered by training pipeline completion events handle this reliably.
Versioning Strategies That Scale
Model versioning is not the same as code versioning. Git and similar tools are designed for text diffs measured in kilobytes. Model files are binary blobs measured in gigabytes. Storing model checkpoints in Git — even with Git LFS — creates repositories that are painful to clone, slow to query, and expensive in storage.
Purpose-built model registries handle this better. MLflow's model registry, DVC (Data Version Control), and LakeFS provide version tracking with metadata, lineage, and tagging without requiring full copies of each version. DVC in particular stores version metadata in Git while keeping the actual binary data in a configurable backend (local filesystem, S3-compatible storage, or NFS), giving you Git-like version semantics without Git's scalability limitations.
Design your versioning scheme around what you will actually need to retrieve. A practical approach assigns each model artifact a composite key: model-name/base-version/adapter-name/training-run-id/checkpoint-step. This hierarchy supports common queries — "give me the latest adapter for model X" or "retrieve the checkpoint from training run Y at step 10000" — without requiring a search across flat namespaces.
Store metadata alongside each version: training configuration, dataset identifiers, evaluation metrics, hardware used, and the Git commit of the training code. This metadata is small and belongs in a database or registry, not embedded in the model file. When you need to reproduce a result or debug a regression, this metadata lets you reconstruct the exact conditions that produced a specific model version.
Checkpoint Write Optimization
Writing a 30 GB checkpoint to disk while GPUs sit idle is a direct hit to training throughput. At a write speed of 2 GB/s, a single checkpoint write takes 15 seconds. Over a training run with checkpoints every 500 steps, this adds up to hours of wasted GPU time.
Asynchronous checkpointing solves this by overlapping checkpoint writes with continued training. The training process copies the model state to host memory (CPU RAM), then resumes training immediately while a background thread writes from host memory to storage. PyTorch's distributed checkpointing module and frameworks like DeepSpeed support asynchronous writes natively. The trade-off is increased host memory usage — you need enough CPU RAM to hold at least one full checkpoint while the write completes.
For distributed training across multiple GPUs or nodes, use sharded checkpointing. Instead of gathering the full model state to a single node and writing one large file, each node writes its own shard in parallel. This distributes the I/O load across multiple storage paths and reduces wall-clock time proportionally to the number of shards. PyTorch's FSDP (Fully Sharded Data Parallel) produces sharded checkpoints by default.
Incremental checkpointing writes only the parameters that changed since the last checkpoint. For fine-tuning runs where most parameters are frozen (LoRA, for example), the changed parameters may be less than 1% of the total model size, making checkpoint writes nearly instantaneous. Even for full fine-tuning, delta-based approaches can reduce write volume significantly if parameter updates are sparse.
Recovery and Disaster Scenarios
Model storage architecture must handle three failure scenarios: training interruption, inference node failure, and catastrophic storage loss.
For training interruptions — hardware failures, OOM crashes, power events — checkpoint frequency determines recovery cost. Checkpointing every 500 steps means losing at most 500 steps of training upon restart. Calculate the time cost of lost steps and balance it against checkpoint write overhead. For long fine-tuning runs on expensive hardware, checkpointing every 200-300 steps is often justified.
For inference node failures, the recovery question is how quickly you can load a model onto a replacement node. A 7B model in 16-bit precision takes 10-30 seconds to load from NVMe storage, but several minutes to pull from network storage. Pre-staging models on inference nodes — keeping a local copy of the active model on each node's local storage — reduces failover time to a cold start of the inference server plus model load from local disk.
For catastrophic storage loss, replication is the answer. Replicate the warm tier to a second storage system, ideally in a separate failure domain (different rack, different power circuit, or a secondary site). Object storage systems like MinIO support built-in replication. For the cold tier, standard enterprise backup practices apply — but verify that your backup system can handle the file sizes involved. Many backup solutions struggle with individual files exceeding 50 GB.
Test recovery regularly. Run a drill where you restore a model from cold storage and serve inference from it. Measure the end-to-end time and verify that the restored model produces identical outputs to the original. Recovery processes that have never been tested are recovery processes that do not work.
Putting It Together
Model storage is infrastructure that either enables or constrains your AI operations. A well-designed storage architecture lets teams train without interruption, deploy with confidence, and recover from failures quickly. A neglected one creates bottlenecks that slow training, complicate deployment, and risk data loss.
Start with the basics: measure your current checkpoint write throughput, inventory your model artifacts and their sizes, and implement tiered storage with automated lifecycle policies. Add versioning through a model registry with proper metadata. Optimize checkpoint writes using asynchronous and sharded techniques. Test your recovery procedures before you need them.
The investment scales with your AI ambitions. An organization running occasional fine-tuning experiments can manage with a simple NFS share and manual versioning. An organization operating multiple models in production across several teams needs the full architecture described here. Match the sophistication of your storage to the maturity of your AI operations, and plan for growth — model sizes and experiment volume only increase over time.
Featured image by Steve A Johnson on Unsplash.