Edge AI Deployments on Private Bare Metal Machines

Deploy Edge AI on Dedicated Servers

Edge AI Deployments on Private Bare Metal Machines

Deploy Edge AI on dedicated servers means running AI models directly on servers which is located close to the users or data sources, which brings faster responses, better privacy, and less internet dependence. Benefits of AI deployments in dedicated servers include low latency, data privacy and security, also it works even with a limited internet connection.

This tutorial assumes access to a dedicated server with root privileges. If you need ready-to-use infrastructure, PerLod GPU Dedicated Servers include options with NVIDIA GPUs, NVMe storage, and high-bandwidth networking, which is ideal for local AI inference and model serving.

Deploy Edge AI on Dedicated Servers Architecture

To deploy AI Edge, we use a dedicated server that can be CPU-only or include NVIDIA, AMD, or Intel accelerators. Be sure your dedicated server has a fast NVMe SSD for storage. Also, we will run the AI services inside Docker containers.

Inference Servers Used: Depends on Hardware:

  • NVIDIA GPUs: Use NVIDIA Triton Inference Server, which works with ensorRT, ONNX, PyTorch, and TensorFlow.
  • Intel CPUs/iGPUs/NPUs: Use OpenVINO Model Server (OVMS) or ONNX Runtime with the OpenVINO plugin.
  • AMD GPUs: Use ROCm with ONNX Runtime, vLLM, or other ROCm-supported backends.

This guide uses an Ubuntu LTS OS with a sudo user.

Initial Server Setup for Edge AI Deployment

You must run the system update and install the required tools to prepare your server for Edge AI deployments:

sudo apt update && sudo apt upgrade -y
sudo apt install build-essential curl gnupg2 ca-certificates lsb-release jq unzip ufw -y

For more security, it is recommended to create a dedicated user to manage the node:

sudo useradd -m -s /bin/bash edge && sudo usermod -aG sudo,docker edge

Then, implement some basic system hardening with the commands below:

echo -e "net.ipv4.ip_forward=1\nnet.ipv4.conf.all.rp_filter=1\nnet.ipv4.tcp_syncookies=1" | sudo tee /etc/sysctl.d/99-edge.conf

sudo sysctl --system

Configure firewall rules to deny all incoming traffic by default, and only allow necessary ports:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp

Enable the firewall with the command below:

sudo ufw enable

Also, you must install Docker on your server. To do this, you can check the official Docker page and install Docker on Ubuntu via the APT repository.

Once you are done with the Docker installation, verify it with the command below:

docker run --rm hello-world

Configuring Hardware Acceleration

To increase your server performance, you must configure hardware acceleration appropriate to your setup. You can choose one of the following paths that match your hardware, including NVIDIA GPUs, Intel processors, or AMD graphics cards.

Path 1. NVIDIA GPUs Acceleration

To use NVIDIA hardware acceleration, you must install the necessary drivers and container toolkit, which allows Docker containers to access your GPU for optimal performance.

To install the proper drivers and container toolkit, run the commands below:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -fsSL https://nvidia.github.io/libnvidia-container/$(. /etc/os-release; \
  echo $ID$VERSION_ID)/libnvidia-container.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install nvidia-container-toolkit -y
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

To verify that Docker containers can successfully detect and use your NVIDIA GPU, you can run:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Path 2: Intel Acceleration OpenVINO

For Intel-based acceleration, OpenVINO provides optimized performance across CPUs, integrated graphics, and AI accelerators without complex driver setups. You can check the next deployment steps.

Path 3: AMD GPU Acceleration ROCm

AMD users can utilize the ROCm platform for GPU acceleration, which enables high-performance computing on supported AMD graphics cards.

Prepare and Optimize AI Model Inference

Before deployment, you need to export and optimize your trained model for production inference, which ensures maximum performance on your specific hardware.

1. Export to ONNX: Convert your model from its original training framework, like PyTorch or TensorFlow, into this universal format. This allows the model to run on many different types of hardware and software, which makes it ready for production deployment.

2. Hardware-Specific Optimization: Different accelerators require different optimized model formats. Choose the conversion path below that matches your target hardware to achieve the best possible speed and efficiency:

  • NVIDIA GPUs: Convert your models to TensorRT (FP16 or INT8) for faster performance, or let Triton automatically run ONNX models using the TensorRT backend.
  • Intel CPUs/iGPUs/NPUs: Convert models to OpenVINO IR format, and optionally quantize to INT8 for better speed and efficiency.
  • AMD GPUs: Keep models in ONNX format and use ROCm-enabled runtimes for inference.

3. Creating the Model Repository Structure: Inference servers require a specific directory structure to load your models used by both Triton and OVMS:

/opt/models/
my_model/
1/
model.onnx    # or model.plan (TensorRT) or model.xml/bin (OpenVINO IR)
config.pbtxt  # Triton (if using Triton)

Using versioned model directories enables seamless rollbacks and zero-downtime updates, which is essential for maintaining reliable production services.

Deploying an AI Model for Inference

In this step, you must deploy your optimized model to a production-ready inference server. Choose the path below that matches your hardware to start serving predictions.

NVIDIA Triton GPU Serving

For NVIDIA GPU users, Triton Inference Server provides high-performance model serving with support for multiple frameworks and concurrent execution.

Use the commands below to create a model directory and run Triton:

sudo mkdir -p /opt/models/my_model/1
sudo chown -R $USER:$USER /opt/models

# (Put your model artifact in /opt/models/my_model/1/)

# Start Triton
docker run --rm --gpus all \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v /opt/models:/models \
  nvcr.io/nvidia/tritonserver:24.10-py3 \
  tritonserver --model-repository=/models

Ports used:

  • 8000: HTTP
  • 8001: gRPC
  • 8002: Metrics

To check server is up and running, you can run:

curl -s http://localhost:8000/v2/health/ready

Now you can customize how your model runs by defining its input/output specifications, batch size, and GPU instance allocation in a configuration file.

Create the following example file for ONNX:

sudo nano config.pbtxt

Add the following configuration to the file:

name: "my_model"
platform: "onnxruntime_onnx"
max_batch_size: 16
input [
  { name: "input", data_type: TYPE_FP32, dims: [3, 224, 224] }
]
output [
  { name: "logits", data_type: TYPE_FP32, dims: [1000] }
]
instance_group [{ kind: KIND_GPU, count: 1 }]

Use Triton’s built-in performance analyzer to test your model’s throughput and latency under various load conditions:

docker run --rm --net=host nvcr.io/nvidia/tritonserver:24.10-py3 \
  perf_analyzer -m my_model -u localhost:8000 -i HTTP --concurrency-range 1:64

OpenVINO Intel Acceleration Serving

For Intel hardware, OVMS delivers optimized inference across CPUs, iGPUs, and NPUs with minimal configuration. You can start the OpenVINO Model Server container to serve your optimized IR or ONNX models through gRPC and REST endpoints:

# Directory layout:
# /opt/models/my_model/1/model.xml
# /opt/models/my_model/1/model.bin
# or /opt/models/my_model/1/model.onnx

docker run --rm \
  -p 9000:9000 -p 8001:8001 \
  -v /opt/models:/models \
  openvino/model_server:latest \
  --model_path /models/my_model \
  --port 9000 --rest_port 8001 --target_device AUTO

Client example:

# pip install ovmsclient
from ovmsclient import make_grpc_client
cli = make_grpc_client("localhost:9000")
outputs = cli.predict({"input": your_numpy_batch}, "my_model")

ONNX Runtime with OpenVINO

For lightweight Python deployments, use ONNX Runtime with the OpenVINO execution provider for flexible CPU/iGPU/NPU inference.

Install ONNX runtime in a Python environment shell:

python -m venv venv && source venv/bin/activate
pip install onnxruntime-openvino onnx

Load and run the AI model with:

import onnxruntime as ort

sess = ort.InferenceSession(
    "model.onnx",
    providers=[("OpenVINOExecutionProvider", {"device_type":"AUTO"})]
)  # AUTO chooses CPU/iGPU/NPU available
# Run inference
res = sess.run(None, {"input": your_numpy_batch})

AMD ROCm Runtime

AMD users can use ROCm-supported runtimes like ONNX Runtime or PyTorch for GPU-accelerated inference on compatible hardware.

You can follow AMD’s ROCm installation guide for the correct version pairing with your kernel and OS.

Secure AI Inference Service with Nginx Reverse Proxy

To safely display your AI service to the network, you can set up Nginx as a secure gateway that provides TLS encryption and controlled access to your inference endpoints.

Install Nginx and generate TLS certificates with the commands below:

sudo apt install nginx -y
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
  -keyout /etc/nginx/self.key -out /etc/nginx/self.crt -subj "/CN=edge.local"

Create the Nginx configuration file with the following command:

sudo nano /etc/nginx/sites-available/edge-ai.conf

Add the following config to the file:

server {
  listen 443 ssl;
  ssl_certificate     /etc/nginx/self.crt;
  ssl_certificate_key /etc/nginx/self.key;

  location /triton/ {
    proxy_pass http://127.0.0.1:8000/;               # Triton HTTP
    proxy_set_header Host $host;
  }
  location /ovms/ {
    proxy_pass http://127.0.0.1:8001/;               # OVMS REST
    proxy_set_header Host $host;
  }
}

Enable your configuration, check for syntax errors, and reload Nginx to apply the changes:

sudo ln -s /etc/nginx/sites-available/edge-ai.conf /etc/nginx/sites-enabled/
sudo nginx -t 
sudo systemctl reload nginx

Tip: For production environments, consider adding authentication layers like JWT tokens or API keys at the proxy level for more restricted access to your inference services.

FAQs

What is Edge AI, and how is it different from cloud AI?

Edge AI runs AI models directly on local devices or servers, close to where data is generated. Unlike cloud AI, it doesn’t depend on constant internet connectivity and offers faster responses, improved privacy, and reduced bandwidth usage.

Why use dedicated servers for Edge AI?

Dedicated servers offer more processing power, memory, and storage.

Which hardware platform is best for Edge AI?

It depends on your workload:
NVIDIA GPUs: Best for deep learning inference.
Intel CPUs/iGPUs: Optimized for OpenVINO-based workloads.
AMD GPUs: Ideal for open ecosystems with ROCm support.

Can I integrate Edge AI with IoT systems?

Yes. You can connect edge inference servers to IoT platforms via MQTT, Kafka, or REST APIs.

Conclusion

Deploying Edge AI on dedicated servers means using powerful enterprise hardware to run AI models locally, close to where data is created. By picking the right hardware, optimizing models for your platform, and using modern tools like Triton or OpenVINO for deployment, organizations can get ultra-low latency, keep their data private, reduce internet dependence, and build scalable, easy-to-manage systems.

We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest Edge AI articles.

For readers who would like a more detailed guide on packaging and deploying AI inference services via containers, Containerized AI Serving with Docker on Bare Metal provides step-by-step instructions.

Also, if you are interested in the upstream part of the AI workflow, you can check out the Automated ML Workflow via Dedicated Infrastructure tutorial.

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.