How to Benchmark Your GPU for AI Training and Inference

run GPU benchmarks for AI training and inference

How to Benchmark Your GPU for AI Training and Inference

Benchmarking your GPU shows how fast it really is for AI training and inference, using clear metrics like samples per second and tokens per second. This guide shows you how to run GPU benchmarks for AI training and inference.

When you have real benchmark numbers, it becomes much easier to choose between local hardware and cloud GPUs, plan capacity for new projects, and explain performance trade‑offs to your team or clients.

The goal is to help you run repeatable, scriptable benchmarks on any modern NVIDIA GPU and, with small changes, on AMD or Apple GPUs too. By the end, you will know how to measure training throughput with PyTorch and LLM inference throughput with vLLM, then log the results for later comparison.

This guide assumes a Linux server or workstation with a CUDA‑capable GPU and Python 3.10+.

If you prefer not to manage hardware yourself, you can run these same benchmarks on cloud GPUs from Perlod Hosting, using SSH access and the exact commands shown in this guide.

Prerequisites for GPU Benchmarks for AI Training and Inference

Before starting the benchmark, make sure these requirements are in place:

  • GPU drivers and CUDA: Install the latest NVIDIA driver and CUDA runtime that match your GPU and PyTorch version.
  • Python, PyTorch, and vLLM: Use a recent Python and a GPU‑enabled PyTorch build; vLLM needs a modern CUDA stack and is designed for high‑speed LLM inference.
  • Basic tools: Have nvidia-smi on your PATH to see GPU load and memory, and optionally watch for live updates.

You should also make sure your CUDA toolkit and cuDNN versions are supported by the PyTorch build you install.

Tip: For teams that want to test different GPU types without buying servers, AI hosting plans are a good option.

If you plan to test AMD or Apple GPUs later, check the right backend first and keep a note of which backend you used for each test.

Example setup for vLLM inside a Python virtual environment includes:

python -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install vllm

Check that PyTorch sees your GPU:

python - << 'PY'
import torch
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device 0:", torch.cuda.get_device_name(0))
PY

This quick check verifies PyTorch GPU access.

If the script prints CUDA available: False, check your driver installation, CUDA toolkit, and that you are inside the right Python environment before you continue.

How to Measure AI GPU Performance?

At this point, you will learn a step‑by‑step GPU benchmark that you can run on any compatible machine. We will explain how to check GPU health, measure training speed with PyTorch, measure LLM inference speed with vLLM, and log all results so you can compare different GPUs and settings over time.

Follow the steps in order, keep the same scripts and parameters for each test run, and you will build your own small, reliable benchmark suite for AI workloads.

If you later add more complex models or larger datasets, you can still use this same flow as a quick “smoke test” before running long production jobs.

Measure AI GPU Performance

Check GPU Health and Load

You must verify that the OS and drivers see your GPU and that it is idle before you run tests.

Check your NVIDIA drivers with the command below:

nvidia-smi

Watch GPU usage in real time with the following command:

watch -n 1 nvidia-smi

For more detailed live stats like power, memory, and more, you can run:

nvidia-smi dmon

These commands are standard for monitoring GPU health and utilization on NVIDIA systems. Run them once before and during your benchmarks so you can spot throttling or memory issues.

If you notice the GPU is already busy or pinned close to 100% memory usage, stop background jobs or move the benchmark to a quieter machine; otherwise, your numbers will not be reliable.

Benchmark Training Throughput with PyTorch

You can also measure how many samples per second your GPU can process in a simple training loop. This is not a full MLPerf run, but it gives you a clear relative score across GPUs.

Create a file benchmark file “train_bench.py” with this script:

import time
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

# Simple model and input to stress the GPU
batch_size = 1024
features = 4096

model = torch.nn.Sequential(
    torch.nn.Linear(features, features),
    torch.nn.ReLU(),
    torch.nn.Linear(features, features),
).to(device)

x = torch.randn(batch_size, features, device=device)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)

# Warmup to let CUDA and cuDNN tune kernels
for _ in range(10):
    optimizer.zero_grad(set_to_none=True)
    y = model(x)
    loss = y.sum()
    loss.backward()
    optimizer.step()

# Timed run
iters = 50
torch.cuda.synchronize()
start = time.time()
for _ in range(iters):
    optimizer.zero_grad(set_to_none=True)
    y = model(x)
    loss = y.sum()
    loss.backward()
    optimizer.step()
torch.cuda.synchronize()
end = time.time()

elapsed = end - start
samples = iters * batch_size
print(f"Elapsed: {elapsed:.2f} s")
print(f"Throughput: {samples / elapsed:.1f} samples/s")

Then run the script file with the command below:

python train_bench.py

Record the samples per second value in a table so you can compare GPUs later.

To get a deeper picture, you can repeat the same script with a smaller and a larger batch size and see how throughput and GPU memory usage change.

While the script runs, keep nvidia-smi open in another terminal so you can confirm that the GPU is close to full utilization and not hitting memory limits.

Benchmark LLM Inference Throughput with vLLM

Now test how fast your GPU can serve an LLM, focusing on tokens per second and latency. vLLM has built‑in benchmark tools, so you don’t need to write your own load generator.

Start a vLLM server with your desired model:

python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype bfloat16 \
--tensor-parallel-size 1

This launches an OpenAI‑compatible HTTP server powered by vLLM on your GPU. Keep this terminal open while you run the benchmark client.

In a second terminal, run a vLLM benchmark with:

python -m vllm.entrypoints.benchmark \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--backend vllm \
--num-prompts 80 \
--random-input-len 512 \
--random-output-len 128 \
--max-concurrency 4

This command prints metrics such as tokens per second, time to first token, and average latency, which are standard for LLM inference benchmarks.

If the model does not fit in your GPU memory or the server fails to start, switch to a smaller model or reduce sequence lengths so the benchmark can complete without out‑of‑memory errors.

You can also try a higher “–max-concurrency” value on larger GPUs to see how throughput scales as more requests are processed in parallel.

Log and Compare GPU Benchmark Results

To make your results useful, you should write them down in a clear, structured way. A small table for each test run lets you track how different GPUs and settings perform, so you can quickly see which setup gives you the best speed for your AI workloads.

Create a simple table for each GPU and test:

  • Training test: Batch size, model size (hidden units), throughput (samples/s), and GPU name.
  • Inference test: Model name, max concurrency, tokens per second, average latency, and GPU name.

Over time, you can compare GPUs and different settings using the same scripts.

Tip: If you need stable, 24/7 performance for training and long inference runs, consider using GPU dedicated servers, then apply these benchmarks to compare different GPU models and memory sizes before you lock in a configuration.

FAQs

How long should a GPU benchmark run?

Short runs (30–60 seconds) are fine for quick checks, but longer runs give more stable numbers.

What metrics should I focus on during a GPU benchmark run?

For training, the key metric is usually samples per second at a fixed batch size and model, with total time to finish a full job. For inference, focus on tokens per second, time to first token, and how performance changes as you increase concurrency or batch size.

What is a standard AI benchmark?

MLPerf is the main open benchmark suite for training and inference, with detailed rules so results are comparable across systems. Running full MLPerf is more complex than this tutorial, but the basic ideas are the same.

Final Words

At this point, you have learned GPU benchmarks for AI training and inference. Keep your benchmark scripts under version control so you can reuse them when you upgrade drivers, CUDA, or GPUs.

When you publish or share results, always note the exact software versions, batch sizes, and models, because small changes can shift throughput a lot.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest articles on AI hosting.

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.