Blog

Network Fabric Design for Distributed On-Premises AI Clusters

On-Premises AI · AI Architecture · Best Practices · Advanced · Foundations

Architecture patterns for the network layer connecting GPU nodes in on-premises AI clusters, from InfiniBand topologies to Ethernet-based alternatives and practical bandwidth planning.

Colorful abstract light trails representing high-speed data connections

The Network Is the Bottleneck You Forgot About

When organizations plan on-premises AI infrastructure, the GPU gets all the attention. Teams spend weeks evaluating H100 vs. H200, calculating VRAM requirements, and sizing storage arrays. The network fabric connecting those GPUs often gets a single line in the architecture diagram: "high-speed interconnect." This is a costly oversight.

For single-node inference, the network barely matters — data flows in, predictions flow out, and the bottleneck is GPU compute. But the moment you scale to multi-node training, multi-GPU inference with tensor parallelism, or distributed retrieval-augmented generation, the network becomes the critical path. A training job that takes 4 hours on a well-designed fabric can take 12 hours on the same GPUs connected by a congested Ethernet switch. Worse, some distributed training configurations simply will not converge if gradient synchronization is too slow or too inconsistent.

InfiniBand: The Gold Standard for GPU-to-GPU Communication

InfiniBand (IB) remains the dominant interconnect for serious AI workloads. NVIDIA's DGX and HGX systems ship with ConnectX-7 adapters supporting NDR (400 Gbps) InfiniBand, and the latest generation pushes to XDR (800 Gbps). The advantage is not just raw bandwidth — it is RDMA (Remote Direct Memory Access), which allows GPUs to read and write each other's memory without involving the CPU or operating system kernel.

For distributed training using frameworks like DeepSpeed, Megatron-LM, or PyTorch FSDP, RDMA eliminates the overhead of packing gradients into TCP packets, copying them through the kernel network stack, and unpacking them on the other side. The collective operations (AllReduce, AllGather) that dominate training communication can run at near-wire-speed with IB, while the same operations over TCP/IP pay a substantial software overhead tax.

InfiniBand topologies for AI clusters typically use a fat-tree design. Leaf switches connect directly to GPU nodes (usually 32-40 ports per switch), and spine switches interconnect the leaf switches. A full bisection bandwidth fat-tree ensures that any pair of nodes can communicate at full link speed simultaneously. For clusters under 128 nodes, a two-tier leaf-spine topology suffices. Beyond that, consider a three-tier design or use NVIDIA's rail-optimized topology, which aligns network rails with the NVLink domains inside each server.

Ethernet Alternatives: When InfiniBand Is Not an Option

InfiniBand requires specialized switches, cables, and operational expertise that not every organization can justify. If your workloads are primarily inference, fine-tuning with small batch sizes, or RAG pipelines where the network carries embeddings and document chunks rather than gradient tensors, high-speed Ethernet may be sufficient and significantly cheaper to operate.

Modern 100GbE and 400GbE Ethernet with RoCE v2 (RDMA over Converged Ethernet) brings RDMA capabilities to standard Ethernet hardware. RoCE requires lossless Ethernet, which means configuring Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) on every switch in the path. This is more operationally complex than standard Ethernet but far simpler than maintaining an InfiniBand fabric.

The practical performance gap between NDR InfiniBand and 400GbE RoCE v2 for inference workloads is smaller than many assume. For serving a 70B-parameter model across 4 nodes using tensor parallelism, both interconnects deliver acceptable inter-token latency. The gap widens significantly for large-scale training where AllReduce operations saturate the fabric for sustained periods.

A hybrid approach works well for many on-premises deployments: InfiniBand for the GPU training cluster, and 100GbE Ethernet for the inference serving fleet and supporting infrastructure (storage, monitoring, orchestration). This concentrates the InfiniBand investment where it delivers the most value.

Bandwidth Planning: How Much Is Enough?

Sizing network bandwidth requires understanding your communication patterns. Three workload profiles drive different requirements:

Distributed training with data parallelism. Each GPU computes gradients independently, then all GPUs synchronize via AllReduce. The data transferred per step equals 2x the model size (for ring AllReduce). A 70B-parameter model in FP16 means roughly 280 GB transferred per synchronization step across the entire ring. At NDR InfiniBand speeds (400 Gbps = 50 GB/s per link), this takes about 5.6 seconds for the communication phase alone on a single link — but ring AllReduce distributes this across all links, so the actual time depends on the number of nodes and the bisection bandwidth of the topology.

Tensor parallelism for inference. When a model is split across GPUs on different nodes, activations must be transferred between nodes at every transformer layer. This is latency-sensitive: each layer waits for the previous layer's output. For a 70B model split across 4 nodes, expect roughly 10-20 MB of activations per layer. At 400 Gbps, this transfers in under 1 millisecond, but at 25 Gbps Ethernet, it takes 5-8 milliseconds per layer — across 80 layers, that adds 400-640ms to every token generation.

RAG and retrieval workloads. The network carries embedding vectors and document chunks between the inference servers and the vector database. This is moderate bandwidth (typically 1-10 Gbps aggregate) but latency-sensitive for real-time applications. Standard 25GbE connections with proper QoS configuration are adequate for most RAG deployments.

Storage Network Considerations

AI workloads impose unique demands on the storage network. Training data must stream to GPUs without starving the compute pipeline. Model checkpoints — often 100-500 GB each — must be written periodically without blocking training. And the model registry needs low-latency random reads for loading models into serving infrastructure.

Separate the storage network from the GPU interconnect fabric. Use dedicated network interfaces for storage traffic, connected to a separate switch tier or at minimum isolated in a separate VLAN with strict QoS guarantees. NVMe-oF (NVMe over Fabrics) is increasingly the protocol of choice for high-performance AI storage, offering near-local-disk latency over the network.

For training data specifically, calculate the minimum required storage bandwidth as: batch_size x sample_size x steps_per_second. If your training pipeline processes 1,024 samples per step, each sample is 2 MB (a high-resolution image or a long document), and you run 2 steps per second, you need at least 4 GB/s of sustained storage read bandwidth. Provision 2-3x this number to account for I/O bursts during data augmentation and shuffling.

Monitoring and Troubleshooting the AI Network Fabric

Network problems in AI clusters manifest as training slowdowns, not as outages. A single degraded link in an AllReduce ring forces all other GPUs to wait, turning a 4-hour training job into an 8-hour one. Without network-level monitoring, this looks like a GPU performance issue.

Instrument your fabric with three layers of monitoring. Link-level health: track port error counters (CRC errors, symbol errors, link retrains) on every switch port. InfiniBand's perfquery tool and Ethernet's standard SNMP counters both expose these. A single port accumulating errors can degrade the entire cluster. Traffic-level visibility: monitor per-port utilization and identify hot spots. Tools like UFM for InfiniBand or sFlow/IPFIX for Ethernet provide traffic analytics. Application-level correlation: correlate network metrics with training metrics (step time, communication time from PyTorch profiler) to identify when the network is the bottleneck.

Establish a baseline during a known-good training run. Record per-step communication time, per-link utilization, and error counts. When future runs deviate, you have a reference point for investigation. Network fabric issues are easier to diagnose when you know what normal looks like.

Featured image by Marek Piwnicki on Unsplash.