Horovod Distributed Training Setup | Best Multi-GPU Guide

How to Build Distributed Training Setup with Horovod

By Mila Harris
December 8, 2025

Training large AI models requires a significant amount of time and power. When you train on a single GPU, you might wait days or weeks to see results. Horovod solves this by letting you spread your training job across many GPUs and servers. This guide will show you the Horovod Distributed Training Setup environment.

You will learn how to install Horovod with MPI and NCCL, configure your PyTorch or TensorFlow training code, and run training across multiple GPUs and servers.

By the end of this guide, your team will have a scalable AI training platform that is faster, cheaper, and easier to manage than cloud alternatives.

Many teams struggle to find affordable, reliable GPU hosting for distributed training. PerLod Hosting solves this by providing dedicated GPU servers with transparent pricing and no oversubscription.

Table of Contents

What is Horovod and How It Works?

Horovod is an open-source distributed training framework created by Uber for TensorFlow, Keras, PyTorch, and Apache MXNet.

It takes your single-GPU training script and scales it to many GPUs with minimal code changes.

Instead of using a central parameter server to coordinate training, Horovod uses the ring-allreduce algorithm, which means each GPU communicates only with its two neighbors in a ring, passing gradients around the circle.

To understand it better, imagine 4 GPUs in a circle. Each GPU:

Sends its gradients to the GPU on the right.
Receives gradients from the GPU on the left.
Adds them together and passes them on.

After going around the ring twice, every GPU has the average of all gradients. This is much faster than having all GPUs talk to a central server.

Key features of Horovod include:

Ring-AllReduce: Efficient gradient averaging without parameter servers.
Multi-Framework: Works with PyTorch, TensorFlow, Keras, and MXNet.
Scalable: From 2 GPUs to hundreds of GPUs.
Fault Tolerant: Recovers from GPU or node failures with Elastic mode.
Easy Integration: Only 5-6 lines of code changes needed.

Horovod achieves 90% scaling efficiency across 128 servers with 4 GPUs each. This means if you double your GPUs, your training speed almost doubles too, with no wasted resources.

Prerequisites for Horovod Distributed Training Setup

Before you begin Horovod Distributed Training setup, you need these:

Operating System: Ubuntu 22.04 or Ubuntu 24.04 LTS.
Multiple GPUs: 2+ NVIDIA GPUs (can be on the same server or multiple servers).
NVIDIA CUDA: CUDA 11.0 or newer (12.0+ recommended).
MPI (Open MPI): Version 4.0+ (avoid 3.1.3, which has bugs).
NCCL 2.2+: NVIDIA Collective Communications Library.
Python: Python 3.8+.
CMake: Version 3.13+.

PerLod Hosting offers dedicated GPU servers with multiple GPUs, full root access, and pre-installed Ubuntu OS. This makes them perfect for Horovod distributed training setups.

Step 1. Install Dependencies For Horovod Setup

The first step is to update your system and install the dependencies and required packages for Horovod setup.

Run the system update and upgrade with the command below:

sudo apt update && sudo apt upgrade -y

Install essential build tools and Open MPI with the following command:

sudo apt install build-essential cmake git g++-8 libopenmpi-dev openmpi-bin -y

Note: Horovod with TensorFlow 2.10+ requires g++-8 or newer for C++17 support.

Check if MPI is installed correctly:

mpirun --version

You should see Open MPI version 4.0 or higher. Avoid Open MPI 3.1.3, which has a known hang issue.

Step 2. Verify GPU Setup for Horovod

You must verify that your GPUs are detected with the command below:

nvidia-smi

You should see a table listing all your GPUs, their memory, temperature, and driver version.

Also, check for the NVIDIA driver version with the following command:

nvidia-smi --query-gpu=driver_version --format=csv,noheader

The driver should be version 450+ for best compatibility.

Check CUDA version with:

nvcc --version

CUDA 11.0+ is required. CUDA 12.0+ is recommended for the latest GPUs.

Step 3. Install NCCL for Horovod Distributed Training

NCCL (NVIDIA Collective Communications Library) is what makes GPU-to-GPU communication fast. NCCL 2 introduced the ability to run ring-allreduce across multiple machines.

You can install it in two ways: using pip and the NVIDIA Repository.

Option A: Install via pip

You can easily use the command below to install the NCCL 2:

pip install nvidia-nccl-cu12

This installs NCCL 2.x for CUDA 12. For CUDA 11, you can use nvidia-nccl-cu11.

Option B: Install via NVIDIA Repository (Recommended for System-Wide)

Add the NVIDIA repository with the commands below:

sudo wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

Then, install NCCL 2 with the following command:

sudo apt install libnccl2 libnccl-dev

Verify your installation by using the commands below:

find /usr -name "libnccl.so*" 2>/dev/null
find /usr -name "nccl.h" 2>/dev/null

Both files should be found. If nccl.h is missing, you only have the runtime, not the development headers needed for building Horovod.

Step 4. Create Horovod Python Environment and Install PyTorch

In this step, you must install PyTorch inside a Python virtual environment.

First, create a Python virtual environment and activate it with the commands below:

python3 -m venv horovod-env
source horovod-env/bin/activate

Upgrade pip with the command below:

pip install --upgrade pip setuptools wheel

Install PyTorch with CUDA support by using the following commands:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Finally, verify PyTorch sees your GPUs:

python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'GPU count: {torch.cuda.device_count()}')"

Step 5: Install Horovod Distributed Training

At this point, you need to install Horovod with PyTorch support and NCCL GPU operations. To do this, run the command below:

HOROVOD_WITH_PYTORCH=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[pytorch]

For TensorFlow, you can use this command:

HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[tensorflow]

Also, for both PyTorch and TensorFlow, you can use this command:

HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_TENSORFLOW=1 HOROVOD_GPU_OPERATIONS=NCCL pip install horovod[all-frameworks]

This build takes several minutes because Horovod compiles native code.

To verify Horovod installation, you can run the command below:

horovodrun --check-build

In your output, you must see:

Horovod v0.28.1:

Available Frameworks:
    [X] TensorFlow
    [X] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo

Make sure [X] NCCL and [X] PyTorch or TensorFlow are enabled.

Step 6: Set Up SSH Between Servers for Horovod (Multi-Server Setup Only)

If you are using multiple servers, they need to communicate without passwords. You can skip this step if you are using a single server with multiple GPUs.

On your main server, create an SSH key with the following command:

ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa

Copy your public key to all other servers:

ssh-copy-id -i ~/.ssh/id_rsa.pub root@SERVER2_IP
ssh-copy-id -i ~/.ssh/id_rsa.pub root@SERVER3_IP

Add servers to known_hosts to avoid prompts with the command below:

ssh-keyscan -t rsa SERVER2_IP SERVER3_IP >> ~/.ssh/known_hosts

Test SSH access with:

ssh root@SERVER2_IP "nvidia-smi"

If this shows the GPUs on SERVER2, the SSH setup is complete.

You can create a hostfile listing all servers and their GPU counts:

nano hostfile

Add this based on your setup:

SERVER1_IP slots=4
SERVER2_IP slots=4
SERVER3_IP slots=4

This tells Horovod you have 4 GPUs on each of 3 servers (12 total).

Step 7: Modify PyTorch Training Code for Horovod

Horovod requires just 6 key changes to your training script. Here is a complete verified example based on the official Horovod documentation:

Before setup (single GPU):

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Build model
model = MyModel()
model.cuda()

optimizer = SGD(model.parameters(), lr=0.01)

# Data loader
train_dataset = datasets.MNIST('./data', train=True, download=True,
                                transform=transforms.ToTensor())
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Training loop
for epoch in range(100):
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        
        if batch_idx % 100 == 0:
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_loader.dataset)}]\tLoss: {loss.item():.6f}')

After setup (Multi-GPU with Horovod):

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import SGD
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
import horovod.torch as hvd

# ===== CHANGE 1: Initialize Horovod =====
hvd.init()

# ===== CHANGE 2: Pin GPU to local rank =====
torch.cuda.set_device(hvd.local_rank())

# Build model
model = MyModel()
model.cuda()

# ===== CHANGE 3: Scale learning rate by number of workers =====
optimizer = SGD(model.parameters(), lr=0.01 * hvd.size())

# ===== CHANGE 4: Wrap optimizer with DistributedOptimizer =====
optimizer = hvd.DistributedOptimizer(
    optimizer,
    named_parameters=model.named_parameters()
)

# ===== CHANGE 5: Broadcast initial parameters and optimizer state =====
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
hvd.broadcast_optimizer_state(optimizer, root_rank=0)

# Data loader
train_dataset = datasets.MNIST('./data', train=True, download=True,
                                transform=transforms.ToTensor())

# ===== CHANGE 6: Use DistributedSampler =====
train_sampler = DistributedSampler(
    train_dataset,
    num_replicas=hvd.size(),
    rank=hvd.rank()
)
train_loader = DataLoader(train_dataset, batch_size=64, sampler=train_sampler)

# Training loop
for epoch in range(100):
    # Important: set epoch for proper shuffling
    train_sampler.set_epoch(epoch)
    
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.cuda(), target.cuda()
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        
        # Only rank 0 prints to avoid duplicate logs
        if batch_idx % 100 == 0 and hvd.rank() == 0:
            print(f'Train Epoch: {epoch} [{batch_idx * len(data)}/{len(train_sampler)}]\tLoss: {loss.item():.6f}')

Code Changes include:

hvd.init(): Initialize Horovod and set up communication.
torch.cuda.set_device(hvd.local_rank()): Pin each process to a different GPU.
lr * hvd.size(): Scale learning rate for larger effective batch size.
hvd.DistributedOptimizer(): Wrap optimizer to average gradients across all GPUs.
hvd.broadcast_parameters() and hvd.broadcast_optimizer_state(): Ensure all workers start with identical weights.
DistributedSampler: Split data so each GPU trains on different batches.

Step 8. Run Horovod Distributed Training

At this point, you can start to run your distributed training.

For a single server with multiple GPUs, you can run:

horovodrun -np 4 -H localhost:4 python train.py

This launches training on 4 GPUs on your local server.

For multiple servers, you can use the “-H” flag and separate hosts with a comma:

horovodrun -np 12 -H SERVER1_IP:4,SERVER2_IP:4,SERVER3_IP:4 python train.py

Or you can use a hostfile:

horovodrun -np 12 --hostfile hostfile python train.py

During training, check GPU utilization:

watch -n 1 nvidia-smi

All GPUs should show high utilization; 70-95% is normal.

Advanced Communication and Performance Tuning for Horovod

As you add more GPUs, they spend a lot of time communicating with each other to sync up data. If this communication is inefficient, it can slow down your entire training process.

In this step, you will learn some performance optimization for your Horovod setup, including batch small updates together, reducing the traffic between servers, and compressing data to save bandwidth. We will also show you how to use a special profiling tool to see exactly where the delays are happening so you can remove them.

Enable Tensor Fusion: Horovod can batch small tensors together before communication. This will set a 64MB fusion threshold:

HOROVOD_FUSION_THRESHOLD=67108864 horovodrun -np 4 python train.py

Enable Hierarchical AllReduce: For setups with 8+ GPUs across multiple servers, you can use:

HOROVOD_HIERARCHICAL_ALLREDUCE=1 horovodrun -np 12 -H hostfile python train.py

This reduces cross-server communication by doing local allreduce first.

Use Gradient Compression: Compress gradients to reduce network bandwidth with:

optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters(),
compression=hvd.Compression.fp16
)

Use Timeline Profiler: Profile where time is spent.

HOROVOD_TIMELINE=/tmp/timeline.json horovodrun -np 4 python train.py

Open the JSON file in Chrome’s chrome://tracing to visualize.

Fix Common Horovod Issues

Here are the most common issues in Horovod distributed training and their solutions:

1. Horovod has not been built with PyTorch support: You must reinstall with the correct flags:

pip uninstall horovod
HOROVOD_WITH_PYTORCH=1 HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod[pytorch]

2. NCCL not found” or “GPU operations disabled: Make sure NCCL development headers are installed:

sudo apt install libnccl-dev

Or set the NCCL path:

HOROVOD_NCCL_HOME=/usr/local/cuda HOROVOD_GPU_OPERATIONS=NCCL pip install --no-cache-dir horovod

3. SSH Permission Denied: Re-copy SSH keys and verify:

ssh-copy-id root@SERVER2_IP
ssh root@SERVER2_IP "hostname"

4. Training Hangs:

Check for non-routed network interfaces:

horovodrun -np 4 -H localhost:4 \
--mca btl_tcp_if_exclude lo,docker0 \
-x NCCL_SOCKET_IFNAME=^lo,docker0 \
python train.py

Check Open MPI version (avoid 3.1.3):

mpirun --version

FAQs

How many GPUs do I need to use Horovod?

Start with 2 GPUs on one server to learn Horovod. For production, use 4+ GPUs across 2+ servers to see real speedup.

How much faster is training with Horovod?

With 4 GPUs and good networking, expect 3-3.5x speedup (90% efficiency). With 8 GPUs, expect 7-7.5x speedup.

Is Horovod Distributed Training free?

Yes, Horovod is completely open-source under the Apache 2.0 license.

Conclusion

Horovod makes distributed training accessible to everyone. You can scale from one GPU to hundreds with just a few code line changes. The ring-allreduce algorithm in Horovod ensures efficient communication without bottlenecks, and NCCL makes GPU-to-GPU transfers super fast.

With Horovod on PerLod’s bare-metal GPU servers, your team gains full control over infrastructure, achieves better scaling efficiency, and pays predictable monthly costs instead of surprise cloud bills.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates on GPU and AI hosting.

For further reading:

PyTorch Model Inference Setup

Reduce GPU Costs for AI Teams

How to Build Distributed Training Setup with Horovod

How to Build Distributed Training Setup with Horovod

What is Horovod and How It Works?

Prerequisites for Horovod Distributed Training Setup

Step 1. Install Dependencies For Horovod Setup

Step 2. Verify GPU Setup for Horovod

Step 3. Install NCCL for Horovod Distributed Training

Step 4. Create Horovod Python Environment and Install PyTorch

Step 5: Install Horovod Distributed Training

Step 6: Set Up SSH Between Servers for Horovod (Multi-Server Setup Only)

Step 7: Modify PyTorch Training Code for Horovod

Step 8. Run Horovod Distributed Training

Advanced Communication and Performance Tuning for Horovod

Fix Common Horovod Issues

FAQs

How many GPUs do I need to use Horovod?

How much faster is training with Horovod?

Is Horovod Distributed Training free?

Conclusion

Post Your Comment

Navigation

Useful Links

Contact us

How to Build Distributed Training Setup with Horovod

How to Build Distributed Training Setup with Horovod

What is Horovod and How It Works?

Prerequisites for Horovod Distributed Training Setup

Step 1. Install Dependencies For Horovod Setup

Step 2. Verify GPU Setup for Horovod

Step 3. Install NCCL for Horovod Distributed Training

Step 4. Create Horovod Python Environment and Install PyTorch

Step 5: Install Horovod Distributed Training

Step 6: Set Up SSH Between Servers for Horovod (Multi-Server Setup Only)

Step 7: Modify PyTorch Training Code for Horovod

Step 8. Run Horovod Distributed Training

Advanced Communication and Performance Tuning for Horovod

Fix Common Horovod Issues

FAQs

How many GPUs do I need to use Horovod?

How much faster is training with Horovod?

Is Horovod Distributed Training free?

Conclusion

Tags :

Post Your Comment

Navigation

Useful Links

Contact us