How to Build Distributed Training Setup with Horovod
Training large AI models requires a significant amount of time and power. When you train on a single GPU, you might wait days or weeks to see results. Horovod solves this by letting you spread your training job across many GPUs and servers. This guide will show you the Horovod Distributed Training Setup environment.
You will learn how to install Horovod with MPI and NCCL, configure your PyTorch or TensorFlow training code, and run training across multiple GPUs and servers.
By the end of this guide, your team will have a scalable AI training platform that is faster, cheaper, and easier to manage than cloud alternatives.
Many teams struggle to find affordable, reliable GPU hosting for distributed training. PerLod Hosting solves this by providing dedicated GPU servers with transparent pricing and no oversubscription.
Table of Contents
What is Horovod and How It Works?
Horovod is an open-source distributed training framework created by Uber for TensorFlow, Keras, PyTorch, and Apache MXNet.
It takes your single-GPU training script and scales it to many GPUs with minimal code changes.
Instead of using a central parameter server to coordinate training, Horovod uses the ring-allreduce algorithm, which means each GPU communicates only with its two neighbors in a ring, passing gradients around the circle.
To understand it better, imagine 4 GPUs in a circle. Each GPU:
- Sends its gradients to the GPU on the right.
- Receives gradients from the GPU on the left.
- Adds them together and passes them on.
After going around the ring twice, every GPU has the average of all gradients. This is much faster than having all GPUs talk to a central server.
Key features of Horovod include:
- Ring-AllReduce: Efficient gradient averaging without parameter servers.
- Multi-Framework: Works with PyTorch, TensorFlow, Keras, and MXNet.
- Scalable: From 2 GPUs to hundreds of GPUs.
- Fault Tolerant: Recovers from GPU or node failures with Elastic mode.
- Easy Integration: Only 5-6 lines of code changes needed.
Horovod achieves 90% scaling efficiency across 128 servers with 4 GPUs each. This means if you double your GPUs, your training speed almost doubles too, with no wasted resources.
Prerequisites for Horovod Distributed Training Setup
Before you begin Horovod Distributed Training setup, you need these:
- Operating System: Ubuntu 22.04 or Ubuntu 24.04 LTS.
- Multiple GPUs: 2+ NVIDIA GPUs (can be on the same server or multiple servers).
- NVIDIA CUDA: CUDA 11.0 or newer (12.0+ recommended).
- MPI (Open MPI): Version 4.0+ (avoid 3.1.3, which has bugs).
- NCCL 2.2+: NVIDIA Collective Communications Library.
- Python: Python 3.8+.
- CMake: Version 3.13+.
PerLod Hosting offers dedicated GPU servers with multiple GPUs, full root access, and pre-installed Ubuntu OS. This makes them perfect for Horovod distributed training setups.
Step 1. Install Dependencies For Horovod Setup
The first step is to update your system and install the dependencies and required packages for Horovod setup.
Run the system update and upgrade with the command below:
sudo apt update && sudo apt upgrade -y
Install essential build tools and Open MPI with the following command:
sudo apt install build-essential cmake git g++-8 libopenmpi-dev openmpi-bin -y
Note: Horovod with TensorFlow 2.10+ requires g++-8 or newer for C++17 support.
Check if MPI is installed correctly:
mpirun --version
You should see Open MPI version 4.0 or higher. Avoid Open MPI 3.1.3, which has a known hang issue.
Step 2. Verify GPU Setup for Horovod
You must verify that your GPUs are detected with the command below:
nvidia-smi
You should see a table listing all your GPUs, their memory, temperature, and driver version.
Also, check for the NVIDIA driver version with the following command:
nvidia-smi --query-gpu=driver_version --format=csv,noheader
The driver should be version 450+ for best compatibility.
Check CUDA version with:
nvcc --version
CUDA 11.0+ is required. CUDA 12.0+ is recommended for the latest GPUs.
Step 3. Install NCCL for Horovod Distributed Training
NCCL (NVIDIA Collective Communications Library) is what makes GPU-to-GPU communication fast. NCCL 2 introduced the ability to run ring-allreduce across multiple machines.
You can install it in two ways: using pip and the NVIDIA Repository.
Option A: Install via pip
You can easily use the command below to install the NCCL 2:
pip install nvidia-nccl-cu12
This installs NCCL 2.x for CUDA 12. For CUDA 11, you can use nvidia-nccl-cu11.
Option B: Install via NVIDIA Repository (Recommended for System-Wide)
Add the NVIDIA repository with the commands below:
sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
Then, install NCCL 2 with the following command:
sudo apt install libnccl2 libnccl-dev
Verify your installation by using the commands below:
find /usr -name "libnccl.so*" 2>/dev/null
find /usr -name "nccl.h" 2>/dev/null
Both files should be found. If nccl.h is missing, you only have the runtime, not the development headers needed for building Horovod.
Step 4. Create Horovod Python Environment and Install PyTorch
In this step, you must install PyTorch inside a Python virtual environment.
First, create a Python virtual environment and activate it with the commands below:
python3 -m venv horovod-env
source horovod-env/bin/activate
Upgrade pip with the command below:
pip install --upgrade pip setuptools wheel
Install PyTorch with CUDA support by using the following commands:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Finally, verify PyTorch sees your GPUs:
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"
Step 5: Install Horovod Distributed Training
At this point, you need to install Horovod with PyTorch support and NCCL GPU operations. To do this, run the command below:
HOROVOD_WITH_PYTORCH=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[pytorch]
For TensorFlow, you can use this command:
HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow]
Also, for both PyTorch and TensorFlow, you can use this command:
HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[all-frameworks]
This build takes several minutes because Horovod compiles native code.
To verify Horovod installation, you can run the command below:
horovodrun --check-build
In your output, you must see:
Horovod v0.28.1:
Available Frameworks:
[X] TensorFlow
[X] PyTorch
[ ] MXNet
Available Controllers:
[X] MPI
[X] Gloo
Available Tensor Operations:
[X] NCCL
[ ] DDL
[ ] CCL
[X] MPI
[X] Gloo
Make sure [X] NCCL and [X] PyTorch or TensorFlow are enabled.
Step 6: Set Up SSH Between Servers for Horovod (Multi-Server Setup Only)
If you are using multiple servers, they need to communicate without passwords. You can skip this step if you are using a single server with multiple GPUs.
On your main server, create an SSH key with the following command:
ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa
Copy your public key to all other servers:
ssh-copy-id -i ~/.ssh/id_rsa.pub root@SERVER2_IP
ssh-copy-id -i ~/.ssh/id_rsa.pub root@SERVER3_IP
Add servers to known_hosts to avoid prompts with the command below:
ssh-keyscan -t rsa SERVER2_IP SERVER3_IP >> ~/.ssh/known_hosts
Test SSH access with:
ssh root@SERVER2_IP "nvidia-smi"
If this shows the GPUs on SERVER2, the SSH setup is complete.
You can create a hostfile listing all servers and their GPU counts:
nano hostfile
Add this based on your setup:
SERVER1_IP slots=4
SERVER2_IP slots=4
SERVER3_IP slots=4
This tells Horovod you have 4 GPUs on each of 3 servers (12 total).
Step 7: Modify PyTorch Training Code for Horovod
Horovod requires just 6 key changes to your training script. Here is a complete verified example based on the official Horovod documentation:
Before setup (single GPU):
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# Build model
model = MyModel()
model.cuda()
optimizer = SGD(model.parameters(), lr=0.01)
# Data loader
train_dataset = datasets.MNIST('./data', train=True, download=True,
transform=transforms.ToTensor())
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
# Training loop
for epoch in range(100):
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}]\tLoss: {loss.item():.6f}')
After setup (Multi-GPU with Horovod):
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
import horovod.torch as hvd
# ===== CHANGE 1: Initialize Horovod =====
hvd.init()
# ===== CHANGE 2: Pin GPU to local rank =====
torch.cuda.set_device(hvd.local_rank())
# Build model
model = MyModel()
model.cuda()
# ===== CHANGE 3: Scale learning rate by number of workers =====
optimizer = SGD(model.parameters(), lr=0.01 * hvd.size())
# ===== CHANGE 4: Wrap optimizer with DistributedOptimizer =====
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters()
)
# ===== CHANGE 5: Broadcast initial parameters and optimizer state =====
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)
# Data loader
train_dataset = datasets.MNIST('./data', train=True, download=True,
transform=transforms.ToTensor())
# ===== CHANGE 6: Use DistributedSampler =====
train_sampler = DistributedSampler(
train_dataset,
num_replicas=hvd.size(),
rank=hvd.rank()
)
train_loader = DataLoader(train_dataset, batch_size=64, sampler=train_sampler)
# Training loop
for epoch in range(100):
# Important: set epoch for proper shuffling
train_sampler.set_epoch(epoch)
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.cuda(), target.cuda()
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
# Only rank 0 prints to avoid duplicate logs
if batch_idx % 100 == 0 and hvd.rank() == 0:
print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_sampler)}]\tLoss: {loss.item():.6f}')
Code Changes include:
- hvd.init(): Initialize Horovod and set up communication.
- torch.cuda.set_device(hvd.local_rank()): Pin each process to a different GPU.
- lr * hvd.size(): Scale learning rate for larger effective batch size.
- hvd.DistributedOptimizer(): Wrap optimizer to average gradients across all GPUs.
- hvd.broadcast_parameters() and hvd.broadcast_optimizer_state(): Ensure all workers start with identical weights.
- DistributedSampler: Split data so each GPU trains on different batches.
Step 8. Run Horovod Distributed Training
At this point, you can start to run your distributed training.
For a single server with multiple GPUs, you can run:
horovodrun -np 4 -H localhost:4 python train.py
This launches training on 4 GPUs on your local server.
For multiple servers, you can use the “-H” flag and separate hosts with a comma:
horovodrun -np 12 -H SERVER1_IP:4,SERVER2_IP:4,SERVER3_IP:4 python train.py
Or you can use a hostfile:
horovodrun -np 12 --hostfile hostfile python train.py
During training, check GPU utilization:
watch -n 1 nvidia-smi
All GPUs should show high utilization; 70-95% is normal.
Advanced Communication and Performance Tuning for Horovod
As you add more GPUs, they spend a lot of time communicating with each other to sync up data. If this communication is inefficient, it can slow down your entire training process.
In this step, you will learn some performance optimization for your Horovod setup, including batch small updates together, reducing the traffic between servers, and compressing data to save bandwidth. We will also show you how to use a special profiling tool to see exactly where the delays are happening so you can remove them.
Enable Tensor Fusion: Horovod can batch small tensors together before communication. This will set a 64MB fusion threshold:
HOROVOD_FUSION_THRESHOLD=67108864 horovodrun -np 4 python train.py
Enable Hierarchical AllReduce: For setups with 8+ GPUs across multiple servers, you can use:
HOROVOD_HIERARCHICAL_ALLREDUCE=1 horovodrun -np 12 -H hostfile python train.py
This reduces cross-server communication by doing local allreduce first.
Use Gradient Compression: Compress gradients to reduce network bandwidth with:
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters(),
compression=hvd.Compression.fp16
)
Use Timeline Profiler: Profile where time is spent.
HOROVOD_TIMELINE=/tmp/timeline.json horovodrun -np 4 python train.py
Open the JSON file in Chrome’s chrome://tracing to visualize.
Fix Common Horovod Issues
Here are the most common issues in Horovod distributed training and their solutions:
1. Horovod has not been built with PyTorch support: You must reinstall with the correct flags:
pip uninstall horovod
HOROVOD_WITH_PYTORCH=1 HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod[pytorch]
2. NCCL not found” or “GPU operations disabled: Make sure NCCL development headers are installed:
sudo apt install libnccl-dev
Or set the NCCL path:
HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod
3. SSH Permission Denied: Re-copy SSH keys and verify:
ssh-copy-id root@SERVER2_IP
ssh root@SERVER2_IP "hostname"
4. Training Hangs:
Check for non-routed network interfaces:
horovodrun -np 4 -H localhost:4 \
--mca btl_tcp_if_exclude lo,docker0 \
-x NCCL_SOCKET_IFNAME=^lo,docker0 \
python train.py
Check Open MPI version (avoid 3.1.3):
mpirun --version
FAQs
How many GPUs do I need to use Horovod?
Start with 2 GPUs on one server to learn Horovod. For production, use 4+ GPUs across 2+ servers to see real speedup.
How much faster is training with Horovod?
With 4 GPUs and good networking, expect 3-3.5x speedup (90% efficiency). With 8 GPUs, expect 7-7.5x speedup.
Is Horovod Distributed Training free?
Yes, Horovod is completely open-source under the Apache 2.0 license.
Conclusion
Horovod makes distributed training accessible to everyone. You can scale from one GPU to hundreds with just a few code line changes. The ring-allreduce algorithm in Horovod ensures efficient communication without bottlenecks, and NCCL makes GPU-to-GPU transfers super fast.
With Horovod on PerLod’s bare-metal GPU servers, your team gains full control over infrastructure, achieves better scaling efficiency, and pays predictable monthly costs instead of surprise cloud bills.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates on GPU and AI hosting.
For further reading: