Implement Robust Redundancy Models for GPU Hosting With Cross-Region Replication
GPU hosting for AI and machine learning workloads demands more than raw computational power; it requires high reliability. When training large models or serving real-time predictions, a single region outage can cost hours of progress and thousands of dollars. Cross-region replication solves this by creating GPU Redundancy clusters that keep your workloads running even when entire data centers fail.
In this guide, you will learn the best redundancy models specifically designed for GPU infrastructure on PerLod Hosting.
Table of Contents
Understanding Redundancy Models for GPU Workloads
Before we start to deploy multi-region GPU hosting, it is recommended to understand redundancy models for GPU workloads.
GPU redundancy models include:
1. Active-Passive: It is the most cost-effective model for GPU hosting, which your primary region handles 100% of traffic while a secondary region remains on standby with minimal GPU instances. When failure occurs, the secondary region scales up automatically.
This Active-Passive model is best for:
- Training workloads that can tolerate brief pause-and-resume cycles.
- Batch processing jobs.
- Development and staging environments.
2. Active-Active: It distributes traffic across multiple regions simultaneously, so that both regions run at full capacity, which provides instant failover and better latency for global users.
This Active-Active model is best for:
- Real-time inference services.
- Customer-facing AI applications.
- Large-scale distributed training.
Essential Recovery Objectives in GPU Hosting
In GPU hosting, two key metrics help measure your disaster recovery setup, including Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
These metrics ensure your AI and ML workloads come back quickly with minimal loss. They guide how often you back up data and how fast you can switch to backup systems.
Recovery Time Objective (RTO): This tracks how long it takes to restart GPU workloads after a failure. For GPU training jobs, target under 15 minutes to avoid wasting compute hours. For real-time inference, aim for less than 1 minute to keep services running smoothly without user disruption.
Recovery Point Objective (RPO): This measures the amount of data or training progress you might lose in a failure. With regular checkpointing and cross-region replication, you can reach near-zero RPO for essential tasks, which means almost no lost work. Set backups every 5-15 minutes to hit this goal without huge storage costs.
GPU Redundancy Setup Architecture Components
To create a powerful GPU redundancy setup, you need the following building blocks:
- Kubernetes Clusters: Separate clusters in each region, each with GPU node pools.
- NVIDIA GPU Operator: Manages GPU drivers, device plugins, and container toolkit automatically.
- Storage Replication: Synchronizes model data, checkpoints, and datasets across regions.
- Backup System: Captures Kubernetes resources and persistent volumes.
- DNS Failover: Routes traffic to healthy regions automatically.
- Monitoring: Tracks replication lag, GPU health, and failover readiness.
Prerequisites To Create a Redundant GPU Cluster
Before starting, you must ensure you have these requirements:
- Two GPU Dedicated Servers in different regions.
- The kubectl is installed and configured on your local machine.
- Helm package manager.
- Access to S3-compatible storage or cloud object storage.
- Root access to your GPU servers.
- Basic knowledge of GPU node labels and Kubernetes scheduling.
Once you meet these requirements, proceed to the following steps to implement the setup.
Step 1. Deploy Primary and Secondary GPU Servers
In this guide, we assume you have ordered the GPU Dedicated Servers from PerLod. For example, here are the Primary and Secondary GPU Servers:
1. Primary GPU Server Configuration (US Region):
Location: United States (Los Angeles)
GPU: NVIDIA RTX 4090 (24GB) or A5000
CPU: AMD Ryzen 9 7950X or Intel Xeon
RAM: 64GB DDR5 ECC
Storage: 2TB NVMe SSD
OS: Ubuntu 22.04 LTS
Network: 10Gbps unmetered
Label: gpu-primary-us
2. Secondary GPU Server Configuration (Europe Region):
Location: Europe (Frankfurt)
GPU: NVIDIA RTX 4090 (24GB) or A5000
CPU: AMD Ryzen 9 7950X or Intel Xeon
RAM: 64GB DDR5 ECC
Storage: 2TB NVMe SSD
OS: Ubuntu 22.04 LTS
Network: 10Gbps unmetered
Label: gpu-secondary-eu
Configuration notes:
- NVMe storage: Provides 10x faster I/O for model loading and checkpointing.
- 10Gbps network: Enables fast cross-region replication.
- Unmetered bandwidth: No surprises in data transfer costs during replication.
After deployment, note the IP addresses from the PerLod’s control panel dashboard.
Step 2. Install Kubernetes on Both GPU Servers
At this point, you must deploy Kubernetes clusters on both primary and secondary GPU servers.
From your primary GPU server, install containerd runtime with the commands below:
ssh root@primary-server-ip
apt update && apt install containerd -y
mkdir -p /etc/containerd
containerd config default > /etc/containerd/config.toml
systemctl restart containerd
Initialize the Kubernetes cluster on the primary GPU server with the command below:
kubeadm init --pod-network-cidr=10.244.0.0/16 --apiserver-advertise-address=primary-server-ip
Save the join command output for the secondary node.
Install the Calico network plugin by using the following command:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.0/manifests/calico.yaml
Now, from the secondary GPU server, join the server to the cluster using the saved command:
ssh root@secondary-server-ip
kubeadm join primary-server-ip:6443 --token <token> --discovery-token-ca-cert-hash <hash>
Once you are done, from the primary server, label the secondary node as GPU-enabled with the commands below:
kubectl label node <secondary-node-name> gpu=true region=eu-frankfurt
kubectl label node <secondary-node-name> topology.kubernetes.io/zone=eu-frankfurt
Command Explanations:
- kubeadm: Standard Kubernetes tool for cluster bootstrapping.
- Calico: Provides high-performance networking and network policies.
- Node labels: Essential for scheduling GPU workloads to correct regions.
Step 3. Install NVIDIA GPU Operator on Both Servers
The GPU Operator automates GPU driver installation and management on your infrastructure.
On both GPU servers, you must install the GPU operator.
First, add the NVIDIA Helm repository with the commands below:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Install the GPU Operator on the primary server with the command below:
kubectl config use-context perlod-primary
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Install GPU Operator on the secondary server with the command below:
kubectl config use-context perlod-secondary
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator
Flag explanations:
- –wait: Waits until all pods are ready before completing.
- -n gpu-operator –create-namespace: Creates a dedicated namespace for GPU components.
The GPU operator automatically detects your GPU hardware and installs the required drivers.
You can verify your GPU operator installation with the following command:
kubectl get pods -n gpu-operator
In your output, you should see pods for nvidia-driver-daemonset, nvidia-device-plugin-daemonset, and nvidia-container-toolkit-daemonset all in Running state.
Step 4. Configure Node Selectors for GPU Workloads
At this point, you must label your GPU nodes to ensure workloads land on the correct hardware.
You can label the GPU nodes in the PerLod primary region (US) with the commands below:
kubectl config use-context perlod-primary
kubectl label nodes -l node.kubernetes.io/instance-type=gpu-server gpu=true region=us-losangeles
kubectl label nodes -l node.kubernetes.io/instance-type=gpu-server topology.kubernetes.io/zone=us-losangeles
Also, label the GPU nodes in PerLod secondary region (EU) with the commands below:
kubectl config use-context perlod-secondary
kubectl label nodes -l node.kubernetes.io/instance-type=gpu-server gpu=true region=eu-frankfurt
kubectl label nodes -l node.kubernetes.io/instance-type=gpu-server topology.kubernetes.io/zone=eu-frankfurt
Command Explanation:
- kubectl label nodes: Applies key-value labels to nodes.
- -l node.kubernetes.io/instance-type=gpu-server: Selects PerLod GPU servers.
- gpu=true: Custom label for GPU scheduling.
- region and topology.kubernetes.io/zone: Standard Kubernetes topology labels for multi-region awareness.
Now you can use node selectors in your workload manifests. The nodeSelector ensures this pod only schedules on PerLod GPU nodes in the specified region:
apiVersion: v1
kind: Pod
metadata:
name: gpu-training-job
spec:
nodeSelector:
gpu: "true"
region: us-losangeles
containers:
- name: training
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
Step 5. Set Up Cross-Region Storage Replication with Longhorn
Longhorn provides distributed block storage with backup capabilities across regions. First, add the Longhorn Helm repository on GPU servers with the commands below:
helm repo add longhorn https://charts.longhorn.io
helm repo update
Install Longhorn on the primary server with the following commands:
kubectl config use-context perlod-primary
helm install longhorn longhorn/longhorn \
--namespace longhorn-system --create-namespace
Install Longhorn on the secondary server with the commands below:
kubectl config use-context perlod-secondary
helm install longhorn longhorn/longhorn \
--namespace longhorn-system --create-namespace
Now we want to configure an S3-compatible backup storage.
You must create a backup secret for the primary server with the commands below:
kubectl config use-context perlod-primary
kubectl create secret generic s3-backup-secret \
--from-literal=AWS_ACCESS_KEY_ID="your-perlod-access-key" \
--from-literal=AWS_SECRET_ACCESS_KEY="your-perlod-secret-key" \
--from-literal=AWS_ENDPOINTS="s3.perlod.com" \
-n longhorn-system
Configure backup target in Longhorn using PerLod’s object storage:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: longhorn-default-setting
namespace: longhorn-system
data:
backup-target: "s3://perlod-gpu-backups-us@us-losangeles/"
backup-target-credential-secret: "s3-backup-secret"
EOF
- Longhorn creates replicated volumes within a PerLod cluster and asynchronous backups to PerLod’s S3 storage.
- The backup target stores volume snapshots that can be restored in any PerLod region.
- AWS_ENDPOINTS points to PerLod’s S3-compatible storage infrastructure.
On the secondary server, configure the same backup target:
kubectl config use-context perlod-secondary
kubectl create secret generic s3-backup-secret \
--from-literal=AWS_ACCESS_KEY_ID="your-perlod-access-key" \
--from-literal=AWS_SECRET_ACCESS_KEY="your-perlod-secret-key" \
--from-literal=AWS_ENDPOINTS="s3.perlod.com" \
-n longhorn-system
The secondary server can now read backups from the primary region’s S3 bucket.
Step 6. Deploy Velero for Kubernetes Resource Backup
Velero backs up Kubernetes resources such as Deployments, ConfigMaps, Secrets, and persistent volumes to the storage.
Add the Velero CLI repo on both GPU servers with the commands below:
wget https://github.com/vmware-tanzu/velero/releases/download/v1.12.0/velero-v1.12.0-linux-amd64.tar.gz
tar -xvf velero-v1.12.0-linux-amd64.tar.gz
mv velero-v1.12.0-linux-amd64/velero /usr/local/bin/
Create an S3 bucket in the object storage via the control panel or API. PerLod provides S3-compatible storage accessible from all regions.
Install Velero on the primary server with the commands below:
kubectl config use-context perlod-primary
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.7.0 \
--bucket perlod-gpu-backups \
--backup-location-config region=us-losangeles,s3ForcePathStyle="true",s3Url="https://s3.perlod.com" \
--snapshot-location-config region=us-losangeles \
--use-node-agent
Install Velero on the secondary server with the following command:
kubectl config use-context perlod-secondary
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.7.0 \
--bucket perlod-gpu-backups \
--backup-location-config region=eu-frankfurt,s3ForcePathStyle="true",s3Url="https://s3.perlod.com" \
--snapshot-location-config region=eu-frankfurt \
--use-node-agent
Flag Explanation:
- velero install: Deploys Velero server components into Kubernetes on GPU servers.
- –plugins: Uses AWS plugin for S3-compatible storage.
- –bucket: Shared bucket accessible from both regions.
- –s3Url: Points to S3-compatible storage endpoint.
- –use-node-agent: Enables Restic integration for persistent volume backups.
Then, create a scheduled backup for GPU workloads:
kubectl config use-context perlod-primary
velero create schedule gpu-workload-backup \
--schedule="0 */6 * * *" \
--include-namespaces ml-workloads \
--include-resources deployments,statefulsets,pods,pvc,configmaps,secrets \
--ttl 72h0m0s
This backs up your ML namespace every 6 hours and retains backups for 72 hours in the storage.
Step 7. Configure DNS Failover
Because we assume you use PerLod’s GPU Servers, you can set up health checks and routing policies in PerLod’s control panel to automatically redirect traffic.
Log in to PerLod control panel, DNS Management, and Add Domain.
Configure primary A record with a health check like this:
Domain: ai-service.yourdomain.com
Type: A
Value: primary-server-ip
TTL: 60
Failover: Enabled
Health Check Type: HTTP
Health Check URL: http://primary-ip:8080/health
Failure Threshold: 3 checks
Check Interval: 30 seconds
Configure secondary A record (automatically used on failure):
Domain: ai-service.yourdomain.com
Type: A
Value: secondary-perlod-server-ip
TTL: 60
Failover: Enabled (Secondary)
Step 8. Implement Automated Failover for GPU Training Jobs
For training jobs, you can use a checkpointing sidecar that continuously saves progress to replicated storage:
apiVersion: v1
kind: ConfigMap
metadata:
name: checkpoint-script
data:
checkpoint.sh: |
#!/bin/bash
while true; do
if [ -f /training/active ]; then
# Sync to PerLod's S3 storage
s3cmd sync /models/checkpoints/ s3://perlod-gpu-backups/checkpoints/
fi
sleep 300 # Sync every 5 minutes
done
---
apiVersion: batch/v1
kind: Job
metadata:
name: distributed-training
spec:
template:
spec:
nodeSelector:
gpu: "true"
region: us-losangeles
restartPolicy: OnFailure
containers:
- name: training
image: pytorch/pytorch:latest
command: ["python", "/workspace/train.py"]
env:
- name: CHECKPOINT_DIR
value: "/models/checkpoints"
resources:
limits:
nvidia.com/gpu: 4
volumeMounts:
- name: model-storage
mountPath: /models
- name: checkpoint-sync
image: s3cmd/s3cmd:latest
command: ["/bin/sh", "/scripts/checkpoint.sh"]
volumeMounts:
- name: scripts
mountPath: /scripts
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: longhorn-gpu-pvc
- name: scripts
configMap:
name: checkpoint-script
The sidecar container continuously syncs checkpoints to S3 storage, which ensures minimal data loss during failover.
To validate your setup, you can simulate a region failure. Create a test deployment with the commands below:
kubectl config use-context perlod-primary
kubectl create deployment gpu-test --image=nginx --port=80
kubectl expose deployment gpu-test --type=LoadBalancer --name=gpu-test-service
Get the service endpoint with:
kubectl get svc gpu-test-service
Then, initiate failover by stopping the primary server via the control panel.
Watch DNS failover with the command below:
watch dig ai-service.yourdomain.com
Next, start the primary server again and restore from Velero backup in the secondary region:
kubectl config use-context perlod-secondary
velero restore create --from-backup gpu-workload-backup-20231201-120000
Best Practices for Production GPU Redundancy
Best practices turn your cross‑region design into something you can trust in production. In this section, you fine‑tune how PerLod handles scaling, backups, and monitoring so your GPUs stay online even during failures.
1. Optimize GPU Node Scaling: Configure Kubernetes Cluster Autoscaler to handle failover load.
PerLod supports automatic scaling via the control panel. You can enable auto-scaling for GPU servers with 1-10 nodes per region.
Configure the priority expander, which prioritizes scaling the secondary region during failover:
kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
10:
- .*primary.*
50:
- .*secondary.*
EOF
2. Use PerLod’s One-Click Apps: Deploy GPU-optimized applications instantly. PerLod’s one-click automation saves hours of manual configuration.
3. Use PerLod’s Managed Backup Service: For critical workloads, enable PerLod’s managed backups.
4. Manage GPU Driver Compatibility: Pin the GPU Operator version to match the recommended CUDA.
helm install gpu-operator nvidia/gpu-operator \
--version=v24.9.0 \
--set driver.version="550.90.07" \
--set toolkit.version="1.14.3"
5. Implement Checkpointing Strategy: For deep learning workloads, save checkpoints every 15 minutes.
# Add to your training script
import torch
def save_checkpoint(model, optimizer, epoch, loss):
checkpoint = {
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, f'/models/checkpoints/ckpt_epoch_{epoch}.pth')
# Sync immediately to S3 storage
os.system(f's3cmd put /models/checkpoints/ckpt_epoch_{epoch}.pth s3://perlod-gpu-backups/checkpoints/')
6. Monitor Replication Lag: Set up alerts for critical metrics using PerLod’s monitoring.
Check Longhorn backup lag:
kubectl -n longhorn-system get backups
Check Velero backup status:
velero backup describe gpu-workload-backup --details
Monitor GPU health:
kubectl get node -l gpu=true -o custom-columns=NAME:.metadata.name,GPU:.status.capacity.nvidia.com/gpu
Then, enable PerLod’s GPU monitoring alerts.
FAQs
What is cross-region redundancy for GPU hosting?
Cross-region redundancy means running your GPU workloads in more than one geographic region so they keep working even if one region goes down.
Should I use active-passive or active-active redundancy models for GPU workloads?
Use active-passive when you want to save costs and can tolerate short downtime. Use active-active when you need the lowest latency and almost no downtime.
How do I handle redundancy for inference services?
Deploy your inference service in multiple regions, put a global load balancer or smart DNS in front, and use health checks to route traffic away from unhealthy regions.
Conclusion
Using redundancy models for GPU hosting with cross-region replication turns your setup into a robust infrastructure. By combining PerLod’s global GPU servers with the NVIDIA GPU Operator, Longhorn for storage, and Velero backups to S3-compatible storage, you can build a system that remains operational even if a whole region goes down, with minimal data loss.
PerLod’s 24/7 support, fast server deployment, and high-speed NVMe storage provide a strong foundation for GPU redundancy.
We hope you enjoy it. Subscribe to X and Facebook channels to get the latest updates and articles on GPU hosting.
