
AI Performance on Dedicated Servers vs Cloud
AI workloads require powerful computing resources to train models, process data, and deliver fast results. When choosing where to run AI applications, two common options are dedicated servers and cloud platforms. Each offers different advantages in terms of performance, scalability, and cost. This guide shows you how to compare AI Performance on Dedicated Servers vs Cloud so you can pick the right option.
In this guide from PerLod Hosting, the system is Linux (Ubuntu 22.04 LTS) with NVIDIA graphics cards, and the AI setup runs in Docker using PyTorch.
Table of Contents
Defining AI Workloads and Metrics
Before you start testing, you need to know what kind of AI task you’re running. You can choose the type of workload and the key metrics that truly matter for your use case.
A clear workload helps ensure fair comparisons between dedicated servers and the cloud.
Workload types: Each workload type represents a different real-world use case.
- Training: Supervised, fine-tune, LoRA, multi-GPU, or multi-node.
- Batch inference: Throughput-oriented.
- Online inference: Latency and jitter-sensitive.
- Retrieval-augmented generation (RAG): GPU, fast CPU, storage, and network.
Determine which workload type accurately represents your actual deployment.
Primary metrics: Metrics are the numbers that help you quantify performance.
- Throughput
- Latency
- Time-to-train
- Utilization
- Scaling efficiency
- Stability
- Cost
- Energy
Tip: Pick your workload type, decide which metrics matter most for your use case, and write them down. This becomes your benchmark plan.
Now, proceed to the following steps to prepare both environments and then start to benchmark them.

Set up an AI Environment in Dedicated Server and Cloud
At this point, we will prepare the software and drivers on both the dedicated server and the cloud. You must install the GPU driver, Docker, and the NVIDIA Container Toolkit so that your workloads run smoothly inside containers.
This will ensure all benchmarks run in clean and portable environments, unaffected by system differences.
First, install the GPU driver (host). Ubuntu recommends the correct driver for your GPU. To install it, run the following commands:
sudo apt update
sudo install ubuntu-drivers -y
sudo reboot
After reboot, verify your GPU driver:
nvidia-smi
Next, install Docker and NVIDIA Container Toolkit. To install Docker, run the following commands:
curl -fsSL https://get.docker.com | sudo sh
sudo usermod -aG docker $USER
Open a new shell or re-login so your user is in the Docker group.
To set up the NVIDIA container toolkit, run the following commands:
distribution=$(. /etc/os-release; echo ${ID}${VERSION_ID})
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/${distribution}/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list > /dev/null
Run the system update and install the NVIDIA toolkit:
sudo apt update
sudo apt install nvidia-container-toolkit -y
sudo nvidia-ctk runtime configure --runtime=docker
Restart Docker:
sudo systemctl restart docker
For a quick check, you can run the CUDA sample in a container:
docker run --rm --gpus all nvidia/cuda:12.4.1-runtime-ubuntu22.04 nvidia-smi
If nvidia-smi inside the container matches the host and lists your GPUs, everything is OK.
Record Hardware Details for AI Benchmark Accuracy
Before you start testing, you must record your hardware details, including GPU model, CPU type, NUMA layout, and storage setup. This information helps explain performance differences later.
Knowing your system’s layout ensures you can interpret results correctly and reproduce your work.
To check GPU model, driver, ECC, power caps, and clocks, you can run:
nvidia-smi -q | less
Verify GPU topology like NVLink and PCIe distances:
nvidia-smi topo -m
Check for CPU type and NUMA systems:
lscpu
numactl --hardware
List storage and filesystems with:
lsblk -o NAME,MODEL,SIZE,ROTA,TYPE,MOUNTPOINT
df -hT
Running Microbenchmarks for AI Environment Performance | Quick PyTorch Test
At this point, you can confirm that your GPUs, drivers, and libraries are working correctly. You can run a simple PyTorch test and low-level math operations to make sure performance looks normal and no hardware bottlenecks exist.
It is a quick way to catch setup issues before you start heavier benchmarks.
Quick PyTorch test in Docker container:
docker run --rm -it --gpus all pytorch/pytorch:2.4.0-cuda12.1-cudnn8-runtime bash -lc "
python - << 'PY'
import torch, time
print('Torch:', torch.__version__, 'CUDA:', torch.version.cuda)
print('GPU available:', torch.cuda.is_available())
print('Device:', torch.cuda.get_device_name(0))
# Simple matmul benchmark
torch.cuda.synchronize()
a, b = torch.randn(8192,8192, device='cuda'), torch.randn(8192,8192, device='cuda')
for _ in range(3): torch.mm(a,b); torch.cuda.synchronize() # warmup
t0=time.time(); torch.mm(a,b); torch.cuda.synchronize(); t1=time.time()
print('Single 8192x8192 GEMM time (s):', round(t1-t0,4))
PY"
During the test, you can monitor utilization and power:
nvidia-smi dmon -s pucvmt
Compare LLM Inference Performance on Cloud vs Dedicated Server
At this point, you can benchmark a real large language model (LLM) server to measure how well each environment (Dedicated server and Cloud) handles real workloads.
Here we deploy an inference server using vLLM, send multiple requests, and record latency and throughput. This test simulates a chatbot or text-generation service under load.
Start vLLM server with:
docker run --rm --gpus all -p 8000:8000 \
-e HF_TOKEN=<your_token> \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9
For the load generator (HTTP), you can install hey or use curl in a loop. Hey is a command-line tool used to send many HTTP requests to a server for load testing. Here is an example with hey.
On a client, install hey:
sudo apt update && sudo apt install golang-go -y
go install github.com/rakyll/hey@latest
export PATH=$PATH:$(go env GOPATH)/bin
To benchmark an AI model, like Llama 3, you can run:
It sends a total of 256 requests and sends up to 16 requests at the same time. Also, set token limits to control compute.
hey -n 256 -c 16 -m POST -H "Content-Type: application/json" \
-d '{
"model":"meta-llama/Meta-Llama-3-8B-Instruct",
"messages":[{"role":"user","content":"Summarize the key differences between SSD and HDD in 5 bullet points."}],
"max_tokens":128,
"temperature":0.2,
"stream":false
}' http://127.0.0.1:8000/v1/chat/completions
You must record RPS, latency, and server-side tokens/s from vLLM logs if available. Repeat with different concurrency (c) and max_tokens to see scaling and saturation.
Testing Image Generation for AI Environment Performance
In this step, you can check how each system handles image generation models, such as Stable Diffusion. You can run a diffusion model and measure how long it takes to generate images.
It helps compare how both environments perform with vision-based AI workloads that are GPU-intensive but different from text tasks.
Run inside a PyTorch container:
docker run --rm -it --gpus all -v $PWD:/work -w /work \
pytorch/pytorch:2.4.0-cuda12.1-cudnn8-devel bash -lc "
pip install --upgrade pip && pip install diffusers transformers accelerate xformers safetensors
python - << 'PY'
import torch, time
from diffusers import StableDiffusionPipeline
model = 'stabilityai/stable-diffusion-2-base' # choose a model you’re allowed to use
pipe = StableDiffusionPipeline.from_pretrained(model, torch_dtype=torch.float16).to('cuda')
prompt = 'a photorealistic cat astronaut, award-winning, 4k'
# Warmup
_ = pipe(prompt, num_inference_steps=30)
torch.cuda.synchronize()
t0=time.time()
_ = pipe(prompt, num_inference_steps=30)
torch.cuda.synchronize()
print('One SD image, 30 steps, seconds:', round(time.time()-t0,3))
PY"
You can change the steps and batch size to compare throughput on dedicated server vs the cloud.
Compare AI Model Training Efficiency on Dedicated Servers vs Cloud GPUs
Here we will simulate AI model training using synthetic data. This removes dataset I/O bottlenecks and focuses on compute performance. By measuring samples per second and GPU utilization, you’ll see how well each setup scales across multiple GPUs for training jobs.
For single-node multi-GPU DDP with ResNet-50 on synthetic data:
docker run --rm -it --gpus all pytorch/pytorch:2.4.0-cuda12.1-cudnn8-devel bash -lc "
pip install --upgrade pip && pip install torchvision
python - << 'PY'
import os, time, torch, torch.distributed as dist, torch.multiprocessing as mp
import torch.nn as nn, torch.optim as optim, torch.utils.data as data
import torchvision.models as models
def train(rank, world_size, iters=200, batch=256):
dist.init_process_group('nccl', rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
model = models.resnet50().cuda(rank)
model = nn.parallel.DistributedDataParallel(model, device_ids=[rank])
opt = optim.SGD(model.parameters(), lr=0.2, momentum=0.9)
x = torch.randn(batch, 3, 224, 224, device='cuda')
y = torch.randint(0, 1000, (batch,), device='cuda')
loss_fn = nn.CrossEntropyLoss().cuda(rank)
# Warmup
for _ in range(10):
opt.zero_grad(set_to_none=True)
loss_fn(model(x), y).backward(); opt.step()
torch.cuda.synchronize()
t0=time.time()
samples=0
for _ in range(iters):
opt.zero_grad(set_to_none=True)
loss_fn(model(x), y).backward(); opt.step()
samples += batch
torch.cuda.synchronize()
dt = time.time()-t0
sps = samples/dt
if rank==0:
print(f'WorldSize={world_size} Iters={iters} Batch/GPU={batch} -> Samples/s: {sps:.1f}')
dist.destroy_process_group()
if __name__=='__main__':
ws = torch.cuda.device_count()
mp.spawn(train, args=(ws,), nprocs=ws, join=True)
PY"
Measure how many samples per second the system processes, and record GPU usage and temperatures during the test. Then, repeat the test using different batch sizes and numbers of GPUs.
Note: If you’re running the test on multiple machines, make sure they are connected with a fast network (for example, EFA or InfiniBand in the cloud, or 25/100 GbE or NVLink on dedicated servers). Also, set the environment variables MASTER_ADDR, MASTER_PORT, and NCCL_* correctly so that all machines can communicate properly.
Stability and Scaling Check in AI Environment
After running tests, you must verify system stability and scaling efficiency. You should look for overheating, power throttling, CPU bottlenecks, or poor GPU-GPU communication. This ensures your results are trustworthy and that both systems are running at their full potential.
nvidia-smi --query-gpu=power.draw,power.limit,clocks.sm,clocks.max.sm,temperature.gpu --format=csv -l 1
Calculate AI Operating and Expenses
In this step, you can translate performance results into money terms. You should calculate how much it costs to run inference or training on each setup, using real prices, energy use, and amortized hardware costs.
This helps you compare which option, cloud or dedicated server, is truly more cost-effective for your workload.
Inference Cost per 1 Million Tokens:
- C_h: Hourly cost of your machine (cloud cost or the average hourly cost for a dedicated server).
- R_toks: Number of tokens processed per second at your target speed or latency.
Formula:
$ per 1M tokens = (C_h / R_toks) * 3600 / 1,000,000
Calculating Dedicated Server Hourly Cost:
- CapEx: Total server cost, including warranty.
- L: How long do you plan to use the server in months?
- Hours: Total hours over that period.
- P_kW: Power used during operation in kilowatts.
- E_rate: Electricity cost per kilowatt-hour.
- Rack_mo: Monthly cost for hosting or rack space, if applicable.
Formulas:
Amortized hardware $/h = CapEx / Hours
Energy $/h = P_kW * E_rate
Colo $/h = Rack_mo / (30*24)
Total dedicated $/h = Amortized + Energy + Colo
Use the Total dedicated cost per hour (C_h) in the first formula to calculate your cost per 1 million tokens.
Tip: If your GPUs are busy (over ~60% usage) and run for more than 12 hours a day, dedicated servers are usually cheaper. For smaller or occasional workloads, cloud servers are often more cost-effective.
Comparing AI Performance on Dedicated Servers vs Cloud
After benchmarking everything, you must decide which environment is best for your needs. You can score each environment based on performance, cost, flexibility, and operational factors.
Dedicated servers give you full control, consistent performance, and strong data security, but they have fixed capacity, take longer to set up, and require more management. On the other hand, Cloud servers are quick to start, easy to scale, and come with managed services, though they can be more expensive over time and have performance variations.
Tip: Use dedicated servers if you have steady workloads, strict data rules, or need custom hardware. Choose the cloud if you need flexibility, global access, or are still testing and scaling your AI projects.
FAQs
Why do AI benchmarks differ between cloud and dedicated servers?
Performance differences often come from hardware isolation and consistency. Dedicated servers provide direct, exclusive access to hardware, while cloud GPUs may share underlying resources or experience network and storage variability. Even with identical specs, latency and throughput can differ due to hypervisor overhead or noisy neighbors.
What metrics matter most for AI infrastructure decisions?
The most relevant metrics are:
Latency and throughput (for inference).
Samples/sec or time-to-train (for training).
Scaling efficiency (multi-GPU or multi-node).
Stability and utilization.
Cost per token/sample/training hour.
Energy and thermal performance.
Which AI workloads benefit most from dedicated servers?
Training and long-running inference jobs such as chatbots, RAG pipelines, or video generation usually benefit from dedicated servers.
Conclusion
When comparing AI performance on dedicated servers vs cloud, there’s no single winner. The best option depends on your workload patterns, scale, and budget. The most effective strategy for many AI teams is a hybrid approach:
Use cloud GPUs for development, research, and burst capacity, while dedicated servers handle steady training and production inference.
By following the benchmarking steps in this guide, you can make data-driven infrastructure decisions that optimize both AI performance and cost efficiency.
If you plan to power your AI projects, you can check for PerLod’s high-performance GPU dedicated servers designed for maximum speed, stability, and flexibility.
We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest updates and tips.
For further reading: