Containerized AI Serving with Docker on Bare Metal
Containerized AI serving with Docker means packaging an AI model and all its dependencies into a Docker container so it can run consistently on any machine. The container exposes an API so applications can send data to the model and get predictions. Deploying AI Serving with Docker on a Dedicated Server requires a powerful, scalable, and secure environment.
This guide will teach you how to set up a fresh Ubuntu server as a powerful AI inference platform, which is capable of serving both Large Language Models (LLMs) and multi-framework vision models using industry-standard tools like vLLM and NVIDIA Triton.
We will also cover both GPU-accelerated and CPU-only setups.
If you are looking for pre-configured hosting solutions optimized for AI workloads, explore our services at PerLod Hosting.
Table of Contents
Architecture of AI Serving with Docker on a Dedicated Server
Building an AI server needs a strong foundation, the right tools, and a clear design. In this guide, we will set up a secure host system that runs Dockerized services that provide OpenAI-compatible endpoints for LLMs and a high-performance inference server for many model types, all accessible through a secure reverse proxy.
Instead of manually managing Python environments and model files, we will use Docker containers to create isolated, reproducible, and portable environments for each service.
The core components include:
- Docker Engine and Docker Compose: The runtime and automation layer that will manage your AI applications.
- NVIDIA Stack for GPU setups: The drivers and container toolkit for GPU setups.
- vLLM: A highly optimized library and server for LLM inference, which provides an OpenAI-compatible API.
- NVIDIA Triton Inference Server: A powerful server that can host models from multiple frameworks like PyTorch, TensorFlow, ONNX, and TensorRT in one place. This is ideal for vision models, embeddings, or any non-LLM task.
- Caddy: A modern reverse proxy that will handle HTTPS termination, routing, and security.
Now that you have understood what we are going to build, proceed to the next step to prepare your system.
Prepare the Host Operating System for an AI Serving Environment
You must be sure your Ubuntu server is patched, verified, and ready for the AI inference setup. Run the system update and upgrade with the command below:
sudo apt update && sudo apt upgrade -y
Reboot the system to load new settings:
sudo reboot
After rebooting, verify the system is running the latest kernel version:
uname -a
Also, list all block storage devices like SSDs and NVMe drives with their filesystems by running the command below:
lsblk -f
Containerize The Workflow: Installing Docker and Compose
In this step, you must install Docker, which is the cornerstone of modern application deployment, like AI serving. It allows you to define and manage multi-container applications with a simple YAML file.
Use the following commands to add Docker’s official GPG key and install Docker and Compose plugin on your system:
sudo apt update
sudo apt install ca-certificates curl gnupg lsb-release -y
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(. /etc/os-release; echo $UBUNTU_CODENAME) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
sudo systemctl enable --now docker
To verify your Docker and Compose installation, check the version:
docker --version
docker compose version
Also, you can use the command below to verify that Docker is working correctly:
docker run --rm hello-world
Security note: Avoid adding the user to the Docker group on multi-tenant servers. While convenient, adding your user to the Docker group grants root-equivalent privileges, as Docker can mount any host directory. Using sudo for every Docker command is safer.
Integrate the NVIDIA Software Stack – Skip If CPU-only
At this point, you must set up the NVIDIA drivers and NVIDIA container toolkit on your system, which lets Docker containers seamlessly use the GPU resources.
Use the following Ubuntu automated tool for detecting your NVIDIA GPU and installing the recommended driver:
sudo apt update
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers install
Reboot the system to load the new driver kernel module:
sudo reboot
After rebooting, confirm he driver is loaded, shows GPU model, utilization, temperature, and memory usage:
nvidia-smi
Then, install the NVIDIA container toolkit with the commands below, which enables –gpus all in Docker:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install nvidia-container-toolkit -y
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Use the command below to run a minimal CUDA container and execute nvidia-smi inside the container:
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
If the output is the same GPU info as the host command, your containerized applications now have full GPU access.
Secure and Tune the Server
Now you must secure the server’s network access and implement a key OS tuning parameter to prevent common container failures.
If your container needs to use folders from your computer, you need to create a special user inside the container that cannot log in. Otherwise, if you don’t mount any folders, you can just let those folders stay owned by root, which is safer and simpler.
Install UFW and allow the necessary rules and ports with the commands below:
sudo apt install ufw -y
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 8000/tcp # vLLM example API
sudo ufw allow 8001/tcp # Triton example HTTP
Enable the UFW firewall and check its status with:
sudo ufw enable
sudo ufw status
Tip: While basic firewall setup provides essential network security, you can consider implementing Multi-Factor Authentication (MFA) and Role-Based Access Control (RBAC) to secure server access. For a complete setup, you can check the MFA and RBAC Server Setup Guide.
Note: If your AI model or tokenizer crashes because it runs out of shared memory, give the container more shared memory. In this guide, we will fix this by setting –shm-size=2g, which means 2 gigabytes of shared memory for each container.
Deploying an OpenAI-Compatible LLM API
In this step, we want to deploy vLLM as a container, which provides a fully compliant OpenAI API endpoint. This means any application that works with OpenAI’s API can be pointed to your own private and high-performance server.
The vLLM is GPU-oriented, for CPU-on;y, consider Ollama or TGI with quantized models. We will discuss it in the next steps.
You must choose a model first, and your choice depends on your GPU’s VRAM. For example, the “meta-llama/Llama-3.1-8B-Instruct” model typically requires ~16GB of VRAM for comfortable operation.
The HF_TOKEN environment variable is only needed if you are pulling a gated model from Hugging Face Hub.:
export HF_TOKEN="hf_xxx"
To serve an OpenAI-compatible endpoint on port 8000, you can run the following command:
docker run --rm -d \
--name vllm \
--gpus all \
-p 8000:8000 \
-e VLLM_WORKER_MULTIPROC_METHOD=spawn \
-e HF_TOKEN="${HF_TOKEN:-}" \
--shm-size=2g \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 8192 \
--host 0.0.0.0
Notes:
- Swap –model with the name (ID) of the Hugging Face model you want to use.
- In production, don’t use “latest” for your Docker image; use a specific version tag or digest so the image doesn’t change unexpectedly.
Now you can run a quick test with the curl command that sends a Chat Completions request and a simpler Completions request.
For Chat Completions, you can run:
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Write a haiku about fast GPUs."}],
"temperature": 0.7
}'
For Completions, you can run:
curl http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer dummy" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain what KV cache is in 2 sentences.",
"max_tokens": 128
}'
Performance tips you can consider:
- Start with –gpu-memory-utilization 0.85, which safely uses about 85% of your GPU memory.
- You can also set limits like –max-model-len, –max-num-seqs, and –tensor-parallel-size.
- Make sure you’re using CUDA version 12 or higher.
You can create a Docker Compose YAML file for vLLM, which is better for version control and reproducibility.
Create the Compose file with the command below:
sudo nano vllm.compose.yml
Add the following configuration to the file:
services:
vllm:
image: vllm/vllm-openai:latest
command:
[
"--model", "meta-llama/Llama-3.1-8B-Instruct",
"--max-model-len", "8192",
"--host", "0.0.0.0"
]
environment:
HF_TOKEN: "${HF_TOKEN:-}"
VLLM_WORKER_MULTIPROC_METHOD: "spawn"
ports:
- "8000:8000"
shm_size: "2g"
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: ["gpu"]
Then, launch the compose file with the following commands:
HF_TOKEN=hf_xxx docker compose -f vllm.compose.yml up -d
docker compose -f vllm.compose.yml logs -f
Tip: For a comprehensive tutorial on deploying general PyTorch models in production environments, including VPS setups, you can check this guide on PyTorch Model Inference Setup on VPS.
Building a Multi-Framework Inference Hub with Triton
The AI ecosystem is multi-lingual, with models trained in PyTorch, TensorFlow, and exported to ONNX. NVIDIA Triton Inference Server can serve them all simultaneously, with advanced features like dynamic batching and models. In this step, we want to set up Triton with a sample model repository.
Triton requires a specific filesystem layout. You can create a host directory and set the correct permissions for it with the following commands:
sudo mkdir -p /opt/triton/models
sudo chown -R $USER:$USER /opt/triton
For adding an ONNX model, for example, ResNet50, the directory layout looks like this:
/opt/triton/models/
resnet50/
1/
model.onnx
config.pbtxt
Create a minimal config.pbtxt file with:
name: "resnet50"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{ name: "input_tensor" data_type: TYPE_FP32 dims: [ 3, 224, 224 ] }
]
output [
{ name: "output_tensor" data_type: TYPE_FP32 dims: [ 1000 ] }
]
You can adjust the names to your model. Also, you can export models from PyTorch to ONNX with torch.onnx.export.
Run the Triton for GPU with the following command:
docker run --rm -d \
--name triton \
--gpus all \
-p 8001:8000 \ # HTTP
-p 8002:8001 \ # gRPC
-p 8003:8002 \ # Metrics (Prometheus)
-v /opt/triton/models:/models \
--shm-size=2g \
nvcr.io/nvidia/tritonserver:24.09-py3 \
tritonserver --model-repository=/models --http-port=8000 --grpc-port=8001 --metrics-port=8002
Run a health check by using the command below:
curl -s http://127.0.0.1:8001/v2/health/ready
And model list with the command below:
curl -s http://127.0.0.1:8001/v2/models
Notes:
- You can send inference requests via HTTP/gRPC using Triton’s v2 protocol; Python and C++ clients exist.
- For CPU-only, choose an image variant that uses ONNX Runtime CPU and omit –gpus all.
Tip: For a focused guide on production-ready TensorFlow Serving with performance optimization techniques, check this guide on Setting up TensorFlow Serving on Dedicated Servers.
Deploying Alternative Runtimes for CPU-Based Deployment
At this point, you can set up lightweight alternatives like Ollama and custom FastAPI services that are optimized for CPU execution.
Ollama is a simple tool that manages the model files and the inference backend, which provides a clean REST API. To do this, you can run the commands below:
docker run --rm -d --name ollama -p 11434:11434 ollama/ollama
# Pull a small model (e.g., Llama 3.2 3B or Qwen 2.5 3B)
curl http://127.0.0.1:11434/api/pull -d '{"name":"qwen2.5:3b"}'
# Generate:
curl http://127.0.0.1:11434/api/generate -d '{"model":"qwen2.5:3b","prompt":"Hello from CPU-only!"}'
Text Generation Inference (TGI): While vLLM is GPU-first, TGI has better support for CPU and quantization backends, which makes it a powerful alternative for CPU-based LLM serving.
FastAPI custom microservice for classic ML: You can wrap any Python model, for example, from scikit-learn, PyTorch, XGBoost, in a REST API. You can use the following script to create the model:
mkdir -p ~/fastapi-ml && cd ~/fastapi-ml
cat > app.py << 'PY'
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
clf = joblib.load("model.joblib") # your trained model
class Item(BaseModel):
x: float
y: float
@app.post("/predict")
def predict(item: Item):
pred = clf.predict([[item.x, item.y]])[0]
return {"prediction": float(pred)}
PY
cat > requirements.txt << 'REQ'
fastapi
uvicorn[standard]
joblib
scikit-learn
REQ
cat > Dockerfile << 'DF'
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py model.joblib ./
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
DF
# Train a trivial model (example):
python - << 'PY'
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
import joblib
X,y = make_classification(n_samples=500, n_features=2, random_state=0)
clf=LogisticRegression().fit(X,y)
joblib.dump(clf,"model.joblib")
PY
docker build -t fastapi-ml:latest .
docker run --rm -d -p 8080:8080 --name fastapi-ml fastapi-ml:latest
curl -X POST http://127.0.0.1:8080/predict -H 'Content-Type: application/json' -d '{"x":0.1,"y":-0.2}'
Routing Traffic and Enabling HTTPS with Caddy
A reverse proxy acts as a secure gateway, managing incoming traffic, terminating TLS for HTTPS, and routing requests to the correct backend service. Caddy simplifies this process by automatically handling TLS certificates and renewal from Let’s Encrypt.
We display vLLM on llm.example.com and Triton on triton.example.com, which point to the server’s IP address. Run the Docker Caddy with:
docker run -d --name caddy \
-p 80:80 -p 443:443 \
-v caddy_data:/data -v caddy_config:/config \
-v /etc/caddy/Caddyfile:/etc/caddy/Caddyfile \
caddy:2
Create the Caddyfile with the command below:
sudo nano /etc/caddy/Caddyfile
Add the following config to the file:
llm.example.com {
reverse_proxy 127.0.0.1:8000
}
triton.example.com {
reverse_proxy 127.0.0.1:8001
}
Reload caddy to apply the changes:
docker exec caddy caddy reload --config /etc/caddy/Caddyfile --adapter caddyfile
That’s it, you are done with deploying AI Serving with Docker on a Dedicated Server.
FAQs
How to fix CUDA Out of Memory (OOM) error?
This is the most common issue in GPU inference. Consider:
Use a Smaller Model, Enable Quantization, Reduce Memory Usage, and Reduce Batch Size.
Is it safe to use the latest tag in production for deploying AI serving with Docker?
Generally, no. Using the latest tag can lead to unpredictable behavior when the image is updated. For production, it is best practice to use a specific and versioned tag.
How to fix container failures with Tokenizer or shared memory errors?
This is because of insufficient shared memory inside the container. The Docker default (64MB) is too small for many AI models. The solution is to increase it using the –shm-size=2g flag in your Docker run command or the shm_size: “2g” directive in your Docker Compose file, as shown in the guide.
Conclusion
At this point, you have learned to deploy AI Serving with Docker on a Dedicated Server, which is capable of serving both LLMs and multi-framework vision models using tools like vLLM and NVIDIA Triton for both GPU-accelerated and CPU-only setups.
PerLod hosting offers pre-configured GPU servers with Docker, NVIDIA stack, and optional model deployment services, which let you focus on your AI applications rather than infrastructure.
We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest articles in AI serving and Machine learning.