Host Your PyTorch Model on a VPS: Full Tutorial for FastAPI & GPU/CPU Hosting

PyTorch Model Inference Setup on VPS

Host Your PyTorch Model on a VPS: Full Tutorial for FastAPI & GPU/CPU Hosting

Deploying a PyTorch model for inference on a Virtual Private Server allows you to make your trained models available for real-time predictions from anywhere. This tutorial provides a complete PyTorch Model Inference Setup on VPS.

In this guide, you will learn to set up a clean Ubuntu environment, build a Docker-based alternative, and optimize performance for real-world use. You’ll also learn how to productionize your deployment using FastAPI with Uvicorn or Gunicorn, manage it as a systemd service, and configure Nginx as a reverse proxy for secure and efficient request handling.

If you want a reliable and stable VPS environment to set up a PyTorch inference model, you can check PerLod Hosting.

Let’s dive into the details.

Complete Guide To PyTorch Model Inference Setup on VPS

This guide will show you the complete process of setting up a production-ready PyTorch model on an Ubuntu VPS. We will package a model using ResNet18 as our example into a FastAPI web service, and then use Nginx as a reverse proxy to handle public traffic securely. We will configure the application to run on 127.0.0.1:8000 and set up Nginx to proxy requests from ports 80 and 443.

The setup is designed to work on a CPU by default, but includes optional steps to leverage an NVIDIA GPU for accelerated inference.

We assume that you have an Ubuntu 22.04 or 24.04 VPS with root or sudo access. Now, follow the steps below to complete the guide.

Prepare VPS for Deploying PyTorch Model Inference

The first step is to prepare your system. Run the system update and upgrade with the following command:

sudo apt update && sudo apt upgrade -y

Then, install the required packages on your server:

sudo apt install build-essential git curl wget ca-certificates pkg-config -y

Also, you must install Python and its dependencies with the command below:

sudo apt install python3 python3-venv python3-dev -y

Verify your Python installation by checking its version:

python3 --version

You can also create a dedicated user, which is optional:

sudo adduser --disabled-password --gecos "" inferuser
sudo usermod -aG sudo inferuser

Enabling GPU Acceleration with NVIDIA CUDA for PyTorch

You can skip this section if you are deploying on a CPU-only server. These steps are only necessary to enable GPU-accelerated inference for a performance boost.

First, you need to install the proper drivers for your NVIDIA GPU. You can identify available drivers with the command below:

ubuntu-drivers devices

This command will list the recommended driver for your GPU. For example, nvidia-driver-535. Install the recommended driver with the command below:

sudo apt install nvidia-driver-535 -y

Reboot your system to load the new drivers:

sudo reboot

After rebooting, log back in and run:

nvidia-smi

You must see a table showing your GPU details, confirming it is recognized and ready.

Now, you must install a PyTorch version built to leverage your GPU’s CUDA cores. First, create and activate a Python virtual environment:

python3 -m venv ~/venvs/infer
source ~/venvs/infer/bin/activate

Upgrade your Pip:

pip install --upgrade pip

Then, install PyTorch and TorchVision with CUDA 12.1 support:

Note: This command is for CUDA 12.x. Ensure the version matches your CUDA driver from nvidia-smi.

pip install "torch==2.4." "torchvision==0.19." --index-url https://download.pytorch.org/whl/cu121

If you skipped the GPU steps and you are using CPU-only, use these commands instead:

python3 -m venv ~/venvs/infer
source ~/venvs/infer/bin/activate
pip install --upgrade pip
pip install "torch==2.4." "torchvision==0.19."

Finally, run the following script to confirm everything is working:

python - <<'PY'
import torch
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
PY

PyTorch Model Inference Project Structure and Setup

At this point, you must create a clean and organized project directory to hold your application code, model, and configuration files. First, create the main project folder and navigate into it with:

mkdir -p ~/apps/pytorch-infer && cd ~/apps/pytorch-infer

Then, create the following files and directories. This structure keeps the project modular and easy to manage.

pytorch-infer/
├── app.py            # Main FastAPI application
├── model.py          # PyTorch model loading & inference logic
├── requirements.txt  # Python dependencies
├── run.sh            # Script to launch the application
└── tests/sample.jpg  # Sample image for testing

The requirements.txt lists all the Python packages required to run the application:

fastapi==0.115.0
uvicorn[standard]==0.30.6
pillow==10.4.0
torch==2.4.*
torchvision==0.19.*

Note on GPU/CUDA: If you followed the optional GPU setup, you can optionally remove the torch and torchvision lines from this file to avoid conflicts.

Next, install requirements with the following command:

source ~/venvs/infer/bin/activate
pip install -r requirements.txt

Minimal PyTorch Model Inference Code

Here is a streamlined model class (model.py) for loading ResNet18 and running predictions. You can replace it with your model as needed.

import torch
import torchvision.models as models
import torchvision.transforms as T

from PIL import Image
from typing import Tuple

class ResNet18Classifier:
    def __init__(self, device: str | None = None):
        self.device = device or ("cuda:0" if torch.cuda.is_available() else "cpu")
        self.model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
        self.model.eval().to(self.device)

        # Preprocessing pipeline
        self.preprocess = models.ResNet18_Weights.DEFAULT.transforms()

        # Optional performance tweaks
        torch.set_grad_enabled(False)
        if torch.cuda.is_available():
            torch.backends.cudnn.benchmark = True

    def predict(self, img: Image.Image) -> Tuple[int, float, str]:
        x = self.preprocess(img).unsqueeze(0).to(self.device, non_blocking=True)
        with torch.inference_mode():
            logits = self.model(x)
            probs = torch.softmax(logits, dim=1)
            conf, idx = torch.max(probs, dim=1)
        # Human-readable label
        label = models.ResNet18_Weights.DEFAULT.meta["categories"][idx.item()]
        return idx.item(), conf.item(), label

Building the Model Inference API: FastAPI App

At this point, you can create the web server that exposes your PyTorch model as a REST API endpoint.

The app.py:

from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
from PIL import Image
import io

from model import ResNet18Classifier

app = FastAPI(title="PyTorch Inference API", version="1.0")

# Lazy singleton
classifier = ResNet18Classifier()

@app.get("/health")
def health():
    return {"status": "ok", "device": classifier.device}

@app.post("/predict")
async def predict(file: UploadFile = File(...)):
    try:
        content = await file.read()
        img = Image.open(io.BytesIO(content)).convert("RGB")
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid image: {e}")

    idx, conf, label = classifier.predict(img)
    return JSONResponse({"class_index": idx, "confidence": conf, "label": label})

Running and Testing the API

At this point, you can build the run.sh:

#!/usr/bin/env bash
set -euo pipefail
source ~/venvs/infer/bin/activate
exec uvicorn app:app --host 127.0.0.1 --port 8000 --workers 1

Then, you can start the server with the following commands:

chmod +x run.sh
./run.sh

Test health endpoint with:

curl -s http://127.0.0.1:8000/health | jq

Test prediction endpoint with:

curl -s -X POST \
  -F "file=@tests/sample.jpg" \
  http://127.0.0.1:8000/predict | jq

Worker Configuration Notes:

  • Single GPU: Use –workers 1
  • CPU only: Tune with –workers 2-4 (based on cores)
  • Multi-GPU: Use one process per GPU (see production section)

Set up Systemd PyTorch Inference Service for Production

Also, you can create a systemd service for automatic restarts and process management. To do this, you can run:

sudo tee /etc/systemd/system/pytorch-infer.service >/dev/null <<'UNIT'
[Unit]
Description=PyTorch Inference API
After=network.target

[Service]
User=inferuser
Group=inferuser
WorkingDirectory=/home/inferuser/apps/pytorch-infer
Environment="PATH=/home/inferuser/venvs/infer/bin"
ExecStart=/home/inferuser/apps/pytorch-infer/run.sh
Restart=always
RestartSec=3
StandardOutput=journal
StandardError=journal
# For GPU: uncomment if you need nvidia libs on PATH
# Environment="LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda/lib64"

[Install]
WantedBy=multi-user.target
UNIT

Then, start and enable the service with:

sudo systemctl daemon-reload
sudo systemctl enable --now pytorch-infer

Check your service status with:

sudo systemctl status pytorch-infer --no-pager

Configure Nginx as a Reverse Proxy for PyTorch Model Inference

Nginx acts as a public-facing web server that forwards requests to your FastAPI application, providing better security, performance, and enabling HTTPS. First, install Nginx with the command below:

sudo apt install nginx -y

Then, create a configuration file with:

sudo nano /etc/nginx/sites-available/pytorch-infer

Add the following content to it:

server {
    listen 80;
    server_name _;

    client_max_body_size 20M;

    location / {
        proxy_pass         http://127.0.0.1:8000;
        proxy_http_version 1.1;
        proxy_set_header   Host $host;
        proxy_set_header   X-Real-IP $remote_addr;
        proxy_set_header   X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header   X-Forwarded-Proto $scheme;
        proxy_read_timeout 300;
    }
}

Once you are done, enable the configuration and test it:

sudo ln -s /etc/nginx/sites-available/pytorch-infer /etc/nginx/sites-enabled/pytorch-infer
sudo nginx -t
sudo systemctl reload nginx

For production with a real domain, you need to enable HTTPS (TLS with Let’s Encrypt):

sudo apt install certbot python3-certbot-nginx -y
sudo certbot --nginx -d api.example.com

Key benefits of using Nginx as a reverse proxy:

  • Security: Nginx handles public traffic; your app runs locally.
  • Performance: Static file serving and load balancing.
  • Reliability: Better timeout and request handling.
  • HTTPS: Easy SSL/TLS setup with Certbot.

Setting Up UFW Firewall Rules

Also, you must configure the UFW to secure your server by allowing only necessary network traffic. Install UFW with:

sudo apt install ufw -y

Then, allow the required ports and services and enable the UFW:

sudo ufw allow OpenSSH
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp # if using TLS
sudo ufw enable

Check your UFW firewall rules with:

sudo ufw status

Scaling Strategies for PyTorch Inference Model

This step will show you how to scale your inference service for maximum performance across different hardware configurations, from multi-core CPUs to single and multi-GPU setups.

For CPU Scaling (No GPU):

  • Increase Uvicorn workers: –workers N (start with cores/2 to cores).
  • Keep one model per process (automatically handled by Uvicorn workers).

For Single GPU:

  • Use –workers 1 to avoid GPU memory contention.
  • If CPU preprocessing becomes a bottleneck, consider 2 workers with lazy model loading in only one worker, or move preprocessing to a threadpool with a single worker.

For Multi-GPU Setup:

  • Run one service instance per GPU using CUDA_VISIBLE_DEVICES in separate processes, then load balance via Nginx.

Example override for GPU0:

sudo systemctl edit pytorch-infer

Add this:

[Service]
Environment="CUDA_VISIBLE_DEVICES=0"

Repeat for GPU1 with a second unit on a different port.

Optimizing PyTorch Inference Performance Safely and Effectively

To ensure your PyTorch model runs as efficiently as possible in production, you can apply several safe and verifiable optimization techniques. These methods improve inference speed, reduce latency, and make better use of your hardware without changing the model’s predictions.

Performance Tips To Consider:

  • Use inference mode (already in code): torch.inference_mode().
  • cudnn.benchmark for static-sized inputs (already set on GPU).
  • Batching: consider accepting multiple images to increase GPU utilization.
  • Channels-last tensor format for CNNs on GPU:
self.model = self.model.to(memory_format=torch.channels_last)
x = x.to(memory_format=torch.channels_last)
  • Half precision on GPU (Amp) if the model supports it:
with torch.autocast(device_type="cuda", dtype=torch.float16):
    logits = self.model(x)
  • JIT/TorchScript (stable across 1.x/2.x):
example = torch.randn(1, 3, 224, 224).to(classifier.device)
scripted = torch.jit.trace(classifier.model, example).eval()
classifier.model = scripted
  • Pin threads on CPU:
import torch
torch.set_num_threads( (os.cpu_count() or 2) // 2 )
  • Warmup: Run a few dummy inferences at the start.

Health Check and Monitoring PyTorch Model Server

Once your model server is running, it’s important to keep track of its health and performance. Adding a /health endpoint, monitoring logs, and checking Nginx access logs help ensure your system is running smoothly and make debugging easier when issues arise.

For journald logs:

sudo journalctl -u pytorch-infer -f

For Nginx access logs:

sudo tail -f /var/log/nginx/access.log /var/log/nginx/error.log

Benchmarking PyTorch Inference API

Before going live, it’s a good idea to benchmark your setup. Simple tools like hey can simulate user traffic and measure how fast your API responds under load. This helps you identify bottlenecks and confirm that your VPS and model can handle real-world usage.

Install a load tool (hey) with:

sudo apt install golang-go -y
go install github.com/rakyll/hey@latest
sudo cp ~/go/bin/hey /usr/local/bin/

For a lightweight health check, run:

hey -n 2000 -c 50 http://127.0.0.1/health

Benchmark predicts which needs a file with:

hey -n 500 -c 10 -m POST -T "multipart/form-data" \
-D <(printf '%s' "$(curl -s -F "file=@tests/sample.jpg" -X POST http://127.0.0.1/predict -v 2>&1 | sed -n '/^>/,$p' | sed -n '/^$/,${p;q}' )") \
http://127.0.0.1/predict

For realistic tests, you can write a small script to POST images in a loop.

Containerizing PyTorch Inference Server with Docker

Docker provides an easy, consistent way to package and deploy your PyTorch inference server. It simplifies environment setup, ensures reproducibility, and allows you to run your model in isolated containers, whether on CPU or GPU, with just a few commands.

In a Docker file:

FROM python:3.11-slim

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential wget curl && \
    rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt /app/

# CPU-only wheels
RUN pip install -U pip && pip install -r requirements.txt

COPY . /app
EXPOSE 8000

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

Build and run the container:

docker build -t pytorch-infer:cpu .
docker run -d --name infer -p 127.0.0.1:8000:8000 pytorch-infer:cpu

Note: For GPU Docker, it requires NVIDIA Container Toolkit:

# Install NVIDIA container toolkit (Ubuntu)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt -y install nvidia-container-toolkit
sudo nvidia-ctk runtime configure
sudo systemctl restart docker

# Build with CUDA wheels inside Docker (change requirements accordingly)
docker run -d --gpus all -p 127.0.0.1:8000:8000 pytorch-infer:gpu

Protecting PyTorch API in Production

Security is essential when exposing your model to the internet. Restricting network access, adding API keys, limiting request sizes, and keeping your software updated will protect your VPS and ensure your model is only accessible to authorized users.

Add an API key with:

from fastapi import Header
@app.post("/predict")
async def predict(x_api_key: str | None = Header(None), file: UploadFile = File(…)):
if x_api_key != "YOUR_SECRET":
raise HTTPException(status_code=401, detail="Unauthorized")

Deploying Your Custom PyTorch Model

Once the base setup works, you can easily replace the example model with your own. Just upload your trained weights, adjust the model definition and preprocessing steps, and make sure the input/output formats match your use case.

You can replace ResNet18Classifier with your model:

  • Copy your .pt/.pth weights to the project folder.
  • In model.py, define your architecture and load_state_dict.
  • Ensure preprocessing matches training.
  • Adjust the return payload.

For example:

self.model.load_state_dict(torch.load("weights.pth", map_location=self.device))
self.model.eval().to(self.device)

FAQs

Do I need a GPU to run PyTorch inference on a VPS?

No. PyTorch supports both CPU and GPU inference. If your model is lightweight or your traffic volume is moderate, CPU-only VPS instances perform well. For heavy workloads or deep learning models, choose a VPS with NVIDIA GPU support.

How much RAM and CPU do I need for PyTorch inference?

This depends on your model size and concurrency. For small models like ResNet18, 2–4 GB RAM and 2 vCPUs are enough.

Can I use Docker for PyTorch deployment?

Yes. Docker provides consistent environments and easier scaling. The tutorial includes both native and Docker-based setups.

Conclusion

PyTorch model inference setup on VPS gives you full control, flexibility, and cost efficiency. With the combination of FastAPI, Uvicorn, and systemd, you can create a reliable, production-grade inference API that runs smoothly on CPU or GPU environments.

If you’re looking for high-performance and affordable infrastructure to host your deep learning models, PerLod VPS Hosting Plans provides optimized VPS instances for AI workloads.

We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest updates and articles.

For further reading:

AI-Driven VPS Resource Optimization Guide

Dedicated Server Vs Cloud: Which one is better for AI Workloads?

AI Automation in the Hosting Industry

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.