The reproducibility crisis in enterprise AI

A data scientist trains a model on Tuesday. It performs well. On Thursday, a colleague tries to retrain with the same script, same dataset, same hyperparameters — and gets materially different results. The team spends two days debugging before discovering that a CUDA driver update rolled out Wednesday night, that the random seed was not propagated to every component, and that the training data had been silently modified by an upstream ETL job.

This scenario plays out constantly in on-premises AI teams. Cloud managed services absorb some of this complexity by locking down the environment. On-premises teams must build that determinism themselves. Reproducibility is not an academic concern — it is a prerequisite for debugging model regressions, satisfying audit requirements, meeting regulatory expectations, and maintaining trust in your model development process.

The five layers of training reproducibility

Reproducibility requires control across five distinct layers. Missing any one of them can introduce enough variance to make results non-replicable:

1. Data reproducibility: the training data must be versioned so that any training run can reference the exact dataset snapshot it used. Tools like DVC (Data Version Control), LakeFS, or Delta Lake provide dataset versioning with Git-like semantics. The critical practice is immutability — once a dataset version is used in a training run, it must never be modified. Store a cryptographic hash of the dataset alongside every training run record.

2. Code reproducibility: the training script, data preprocessing code, configuration files, and evaluation code must all be version-controlled. This is table stakes for most teams, but the details matter. Pin every dependency version in your requirements file or lockfile. Use deterministic package installation (pip freeze, conda lock, or Poetry lock) so that the same environment specification produces the same installed packages six months later.

3. Environment reproducibility: the operating system, CUDA toolkit, cuDNN, GPU driver version, Python version, and every library in the stack must be identical across runs. Container images (Docker or Apptainer/Singularity for HPC environments) are the practical solution. Build training images from a Dockerfile that pins every version explicitly, and tag images immutably in your container registry. Never use :latest tags for training images.

4. Hardware reproducibility: different GPU architectures, different numbers of GPUs, and even different GPU interconnect topologies can produce different results due to floating-point non-determinism in parallel reductions. While perfect hardware reproducibility is impractical, you should record the hardware configuration — GPU model, count, interconnect, and NUMA topology — as metadata for every run, and understand which of your workloads are sensitive to hardware changes.

5. Randomness control: random number generators drive data shuffling, weight initialization, dropout, and data augmentation. Every source of randomness must be seeded, and the seed must be recorded. In PyTorch, this means setting seeds for Python's random module, NumPy, torch, and torch.cuda. For full determinism, enable torch.use_deterministic_algorithms(True) and set CUBLAS_WORKSPACE_CONFIG=:4096:8 — but be aware that deterministic mode can reduce performance by 10 to 20 percent on some operations.

Containerization as the reproducibility foundation

Container images are the single most impactful investment for training reproducibility. A well-built training container captures layers 2 and 3 (code and environment) completely and provides a natural integration point for layers 1 and 5 (data versioning and seed management).

Build your training containers in layers to balance reproducibility with build speed:

Base layer: a pinned NVIDIA CUDA base image (e.g., nvidia/cuda:12.4.1-devel-ubuntu22.04) that includes the CUDA toolkit and cuDNN at specific versions. Rebuild this layer only when you deliberately upgrade CUDA.
Framework layer: PyTorch, TensorFlow, or JAX at a pinned version, installed with pinned dependencies. This layer changes when you upgrade your ML framework.
Application layer: your training code, configuration files, and remaining dependencies. This layer changes with every code update.

Tag each image with a content-addressable hash or a combination of Git commit SHA and build timestamp. Store images in a private container registry with immutable tags enabled — registries like Harbor support tag immutability natively. The rule is simple: a training run record references an image tag, and that tag must resolve to the same image forever.

The training run manifest

Every training run should produce a manifest — a machine-readable record of everything needed to reproduce it. The manifest includes:

Container image reference (registry, repository, digest)
Dataset version identifiers and content hashes
Git commit SHA for the training code
Full hyperparameter set (not just the ones you changed — all of them)
Random seeds for every RNG source
Hardware configuration (GPU model, count, driver version)
Environment variables that affect training behavior
Start time, end time, and training duration
Output model artifact locations and hashes

Store manifests alongside model artifacts in your model registry. Tools like MLflow, Weights and Biases (self-hosted), or ClearML (community edition, on-premises) can automate manifest generation. If you use a custom training launcher, generating a manifest is a straightforward engineering task — the key is making it mandatory, not optional.

The manifest serves double duty. For engineers, it is a debugging tool: when a model behaves unexpectedly, the manifest tells you exactly what produced it. For auditors and regulators, it is evidence that your model development process is controlled and traceable.

Handling the hard cases: non-determinism you cannot eliminate

Some sources of non-determinism are inherent to GPU-accelerated training and cannot be fully eliminated without unacceptable performance costs:

Multi-GPU reductions: when gradients are summed across GPUs, the order of floating-point additions can vary between runs due to timing differences. NCCL (NVIDIA Collective Communications Library) does not guarantee deterministic reductions by default. You can force determinism by using specific reduction algorithms, but this often reduces multi-GPU scaling efficiency.

cuDNN autotuning: cuDNN selects convolution algorithms at runtime based on hardware characteristics and input sizes. The selected algorithm can vary between runs and produce slightly different results. Setting torch.backends.cudnn.benchmark = False and torch.backends.cudnn.deterministic = True forces consistent algorithm selection at a modest performance cost.

Data loading order: multi-worker data loaders can deliver batches in different orders depending on OS scheduling. Use a seeded sampler and set worker_init_fn to seed each data loader worker deterministically.

The pragmatic approach is to distinguish between exact reproducibility (bit-identical results) and statistical reproducibility (results within an expected variance band). Exact reproducibility is achievable for single-GPU training with deterministic mode enabled. For multi-GPU distributed training, target statistical reproducibility: define acceptable variance bounds for your key metrics, and flag runs that fall outside those bounds for investigation.

Making reproducibility a team habit

Technical infrastructure alone does not create reproducible training. The practices must be embedded in the team's workflow:

Mandatory manifests: no model artifact is registered without a complete manifest. Enforce this in your model registry admission policy — if the manifest is incomplete, the registration fails.

Reproduction tests: periodically select a random historical training run and attempt to reproduce it. This is the equivalent of a disaster recovery drill for your training infrastructure. If reproduction fails, investigate why and close the gap.

Environment upgrade discipline: CUDA, cuDNN, and driver upgrades should be deliberate, tested events — not silent background updates. Pin driver versions on training nodes using your configuration management tool (Ansible, Puppet, or similar) and upgrade only through a controlled change process that includes reproduction testing.

Immutability by default: datasets, container images, and configuration snapshots are append-only. Deleting or overwriting a versioned artifact should require explicit approval and leave an audit trail.

Reproducibility is an investment with a delayed payoff. The first time a regulatory auditor asks you to demonstrate how a production model was produced and you can answer completely in five minutes instead of five weeks, the investment pays for itself many times over.

Featured image by Thorium on Unsplash.

AI-Driven Consulting

People & Culture

Academy

Who we are

What we do

Resources

Career

Search across SysArt

Reproducible Training Environments for On-Premises AI Pipelines