Build Distributed GPU Clusters with Networking and Orchestration Tips
Nowadays, distributed GPU clusters are the default method for training large AI models, running high-throughput inference, and sharing expensive accelerators across multiple teams. The most important and advanced part is not buying GPUs; it is about wiring them together with the right networking and orchestration, so they actually run fast and stay manageable over time.
In this guide, DevOps, MLOps, and AI engineers learn how they can design, deploy, and operate distributed GPU clusters with:
- The right network fabric (InfiniBand vs RoCE/Ethernet).
- A Kubernetes‑based orchestration layer tuned for GPUs.
- Practical tips for performance, reliability, and multi‑tenant use.
For teams that prefer GPU servers over building everything from scratch, Perlod Hosting can be a good option.
Table of Contents
What Is a Distributed GPU Cluster?
A distributed GPU cluster for AI is a group of GPU dedicated servers connected over a high‑speed network and managed by a scheduler or orchestrator like Kubernetes. Jobs use multiple GPUs, often across many nodes, to train or serve models faster than a single machine can handle.
Typical use cases of a distributed GPU cluster for AI include:
- Training LLMs and Vision models.
- Multi-GPU fine-tuning and LoRA workloads.
- High‑throughput online inference.
- Research clusters shared by multiple teams or business units.
Key layers in any distributed GPU cluster for AI include:
- Compute layer: GPU servers with NVLink/PCIe, CPU, RAM, and local NVMe.
- Network layer: High‑speed, low‑latency network between nodes.
- Storage layer: Shared datasets, checkpoints, object storage.
- Orchestration layer: Kubernetes, Slurm, Ray, or similar systems to place and manage jobs.
Core Layers of a Distributed GPU Cluster for AI
1. As you saw, the first core layer of distributed clusters is the computer layer (GPU nodes).
A GPU node is usually:
- 2–8 datacenter GPUs like L40S, A100, and H100, with NVLink or high‑lane PCIe.
- Dual or quad CPU sockets with plenty of RAM.
- Local NVMe for fast scratch space and checkpoints.
- One or more 100–400 Gb/s network interfaces for cluster traffic.
It is recommended to keep GPUs on the same NUMA domain where possible and use NVLink when available to speed up intra‑node tensor and gradient exchange.
Also, ensure enough CPU cores and RAM; underpowered CPUs can starve GPUs with slow data loading.
2. Network Fabric (Data Plane): For multi‑node training and inference, the network becomes just as important as the GPUs.
- InfiniBand: It is a very high‑speed network with extremely low delay and very steady performance. It has special hardware that helps GPUs work together efficiently, and it is widely used in HPC and large AI training.
- RoCEv2 over Ethernet: It runs the same fast GPU‑to‑GPU communication (RDMA) over regular Ethernet networks. With the right settings and modern switches, it can perform almost as well as InfiniBand while using standard Ethernet stacks that most teams already know how to manage.
Typical speeds in a modern GPU cluster for AI include:
- 100 Gb/s per node for mid‑range GPU nodes.
- 200–400 Gb/s per node for H100 training nodes.
If the network is slow or lossy, the library that handles communication between GPUs, like NCCL, becomes the slowest part. This makes GPUs wait a long time to share gradients, so they spend more time idle instead of actually training the model.
3. Storage and Data: Storage design should support high-bandwidth reads for datasets, fast local NVMe for temporary data, and checkpoints.
Use a small part of RAM as a special “in‑memory disk” like /dev/shm, for example, 20 GB or more. This lets different training processes on the same machine exchange data very quickly without going through slower regular storage.
For teams, here is a simple pattern:
- Object storage for long‑term datasets and checkpoints.
- Local NVMe on each node for active training runs.
- Small, fast shared filesystem or distributed cache for frequently used artifacts.
Design the Network for Distributed GPU Clusters | InfiniBand and RoCEv2
Both InfiniBand and RoCEv2 can power a strong GPU cluster for AI, but they behave differently.
You can choose the InfiniBand network if you care about ultra‑low latency and consistent performance, which is best for large‑scale training and HPC, and want a lossless fabric with hardware congestion control and adaptive routing.
Also, if you are building a dedicated AI cluster are okay with a specialized stack.
Benefits of InfiniBand:
- Sub‑5µs latency and very low jitter for collectives.
- Built‑in credit‑based flow control and self‑healing paths.
- Used widely in top‑end AI and HPC supercomputers.
You can choose RoCEv2 over Ethernet if you want to reuse your Ethernet skillset and tools and care about scale and flexibility across racks, pods, and even sites.
Also, it is a good option if you want to mix AI traffic with conventional services using one common fabric.
Benefits of RoCEv2/Ethernet:
- Runs over 100–400 GbE switches and NICs; most teams already understand.
- Easier multi‑rack and multi‑site routing.
- Often cheaper and more flexible at a large scale.
Note: RoCEv2 needs careful tuning to avoid packet loss and latency issues. When building a RoCE‑based distributed GPU cluster, pay attention to:
- PFC (Priority Flow Control): Enable for RoCE queues only, not for all traffic.
- ECN (Explicit Congestion Notification): Turn on ECN in switches for early congestion signals.
- DSCP and QoS classes: Mark RoCE traffic to get the correct queue and priority.
- MTU: Use a jumbo MTU like 9000 on all hops consistently.
- Topology‑aware routing: Tune hashing and ECMP to avoid hot spots.
Topology and Oversubscription: For AI clusters, it is common to use a fat-tree or Clos network design, which gives many equal paths between switches so traffic can spread out evenly.
- Try to keep oversubscription low, like 1:1 or at most 2:1, between the lower switches and upper switches for training traffic, so GPUs are not starved by the network.
- Keep GPUs that work together for latency‑sensitive training jobs inside the same group so their traffic does not have to cross slow or congested links.
- For inference clusters, where latency is usually less strict and traffic is more bursty, you can accept a bit more oversubscription to save cost.
Separate Networks for Control and Data: A good practice is to split control traffic and data traffic onto different networks:
- Control plane network: Used for SSH, Kubernetes API, monitoring, and logs. It does not need ultra‑low latency.
- Data plane network: Used for high‑speed GPU traffic such as NCCL, RDMA, and model data. This needs maximum bandwidth and low latency.
In Kubernetes, use the normal CNI plugin for pod IPs, service traffic, and control‑plane tasks. Add a separate SR‑IOV or RDMA interface to pods that run heavy GPU training, so their data traffic bypasses the overlay network and talks directly to the high‑speed NIC for best performance.
Orchestrating Distributed GPU Clusters with Kubernetes
Kubernetes is the default orchestrator for many AI clusters due to its ecosystem, multi‑tenant support, and integrations with GPU providers. Providers give you device plugins so Kubernetes can see GPUs as a normal resource.
- The NVIDIA plugin adds a resource called nvidia.com/gpu.
- AMD and other providers have similar plugins.
Here is a simple example pod requests for one GPU:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test-pod
spec:
containers:
- name: app
image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
resources:
limits:
nvidia.com/gpu: "1"
Important notes:
- Always set GPU limits so the scheduler knows which nodes can run the pod.
- Keep GPU images as small as possible, with only the drivers and runtimes you really need.
In real clusters, not all GPU nodes are the same. You might have different GPU types, memory sizes, and network speeds. You must give nodes labels and then use affinity to place pods correctly.
Label nodes:
kubectl label node gpu-node-1 gpu-type=h100 network=400g
kubectl label node gpu-node-2 gpu-type=l40s network=100g
Use node affinity in a pod:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values: ["h100"]
This will ensure large and latency‑sensitive training jobs land on the best‑connected GPUs and cheaper or test jobs use lower‑tier GPUs. Kubernetes Node Feature Discovery can automate many of these labels.
Multi‑Tenant Isolation: In clusters shared by many teams, use namespaces per team or project, apply ResourceQuotas and LimitRanges to avoid one team grabbing all GPUs, and use RBAC so only authorized users can run GPU workloads.
Also, combine MIG (Multi‑Instance GPU) and time‑slicing to safely share a single GPU between many small jobs.
MIG can split one big GPU into several smaller virtual GPUs, each with dedicated memory and compute, which is ideal for many small inference or dev workloads.
Running Distributed Jobs: On top of Kubernetes, teams often add higher‑level tools to run multi‑GPU jobs, including:
- Kubeflow Training Operators like PyTorchJob, TFJob, and MPIJob.
- Ray or KubeRay for distributed Python workloads.
- Volcano and Kueue for job queuing and gang scheduling.
Modern declarative GPU orchestration can do fault‑tolerant training. If one worker fails, the system restarts it and continues from the last checkpoint instead of losing the whole training run.
Performance Tuning for Distributed Training
At this point, we want to explore the performance tips for a distributed GPU cluster for AI:
You can use standard vNICs for small models. On the other hand, for large models, you can enable SR‑IOV to give pods near‑native NIC access and use RDMA (InfiniBand or RoCEv2) so GPU collectives bypass the CPU and kernel networking stack.
Tests on modern clusters show that:
- Without SR‑IOV/RDMA, the virtual network layer becomes the bottleneck.
- With SR‑IOV and RDMA, GPU utilization improves and training time drops for large models.
For multi‑process training, increase shared memory to at least 20 GiB or more for heavy workloads. Use fast NVMe for dataset caching and checkpoints to avoid slow I/O.
You must also monitor everything, including GPUs, the network, and jobs.
Practical Step‑by‑Step Plan to Build Your First Distributed GPU Cluster for AI
1. The first step is to define your workload. You must answer the following questions, which determine GPU type and count, network speed, and how strict your orchestration and quotas must be.
- Are you focused on training, inference, or both?
- Typical job size: single‑GPU, 8‑GPU, 64‑GPU?
- Are workloads interactive (research notebooks) or batch (overnight training)?
- How many teams will share the cluster?
2. The next step is to choose the network and hardware.
For high‑end, latency‑sensitive training, consider InfiniBand with NCCL‑optimized topologies. For flexible, mixed workloads and easy scaling, use RoCEv2 on modern Ethernet switches with correct tuning.
Pick GPU servers from a provider like Perlod Hosting that support:
- Enough PCIe lanes for high‑bandwidth NICs.
- NVLink between GPUs if needed.
- Enough CPU and RAM for data pipelines
3. You must deploy Kubernetes with GPU support. Install Kubernetes on your GPU nodes and deploy the GPU device plugin and the GPU operator to handle container runtime. Also, label nodes by GPU model and network capabilities, and add training frameworks.
4. Set Up Observability and Policies:
- Install Prometheus with Grafana and GPU metrics exporters.
- Add log collection for pods and system logs.
- Define namespaces, quotas, and RBAC before many teams start using the cluster.
- Create standard job templates (YAML) for training and inference so teams follow best practices by default.
5. Performance tuning:
- Benchmark simple distributed training jobs across 2, 4, 8, 16 GPUs.
- Watch GPU utilization and network bandwidth.
- Turn on SR‑IOV and RDMA where helpful.
- Adjust batch sizes, gradient accumulation, number of workers, and data loader settings.
FAQs
Is InfiniBand always better than RoCE over Ethernet?
Not always. InfiniBand is usually better for the fastest possible training, and RoCE is often easier to scale and manage. For many teams, RoCE or even high‑speed TCP‑only Ethernet is enough, especially for mixed training with inference clusters.
How many GPUs are needed for a distributed cluster?
2–4 GPU servers are often enough to build a distributed cluster.
Why use Kubernetes instead of Slurm for GPU clusters?
Slurm is popular in HPC and classic batch clusters. Kubernetes is best for running microservices and APIs alongside training jobs, wants strong multi‑tenant isolation and self‑service for teams, and relies on cloud‑native tooling and operators.
Conclusion
Distributed GPU clusters are no longer only for hyperscalers. With the right mix of GPU hardware, high‑speed networking, and Kubernetes‑based orchestration, DevOps, MLOps, and AI teams can build clusters that are fast, reliable, and cost‑effective.
Providers like Perlod Hosting give you the GPU hardware foundation so your teams can spend more time shipping models and less time fighting infrastructure.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest articles and updates on GPU hosting.