Automated ML Workflow via Dedicated Infrastructure
Using Automated ML workflows on dedicated servers provides a reliable and high-performance environment for training and deploying machine learning models. Dedicated servers are the best choice for ML workflows because they ensure consistent computing power, which avoids slowdowns. Also, they offer better scalability for large datasets and complex models.
In this guide, we will use Ubuntu 22.04 or Ubuntu 24.04 on a reliable and high-performance dedicated server, and use Docker and Docker Compose to set up an Automated ML Workflow.
Table of Contents
System Prerequisites for Automated ML Workflows on Dedicated Servers
The automated ML workflow we want to set up in this guide has several key components, including:
- MinIO (S3) for storing datasets and artifacts.
- PostgreSQL for metadata management.
- MLflow for experiment tracking and model registry.
- Apache Airflow for automation.
- DVC for data versioning.
- Optionally, NVIDIA Triton can be added for model serving.
- Prometheus, Grafana, and cAdvisor handle monitoring.
Before setting up this workflow, you need to prepare your operating system by configuring firewall rules, installing Docker, and other necessary steps.
Install Required Packages and Configure Firewall Rules
Run the system upgrade and install the required packages on your system with the following commands:
sudo apt update && sudo apt upgrade -y
sudo apt install curl ufw ca-certificates gnupg lsb-release jq unzip git -y
Allow the necessary ports through your firewall and enable the firewall with the commands below:
sudo ufw allow 9000,9001,5000,8080,5432,9090,3000,5555/tcp
sudo ufw enable
Note: In a production environment, allow only within your LAN or VPN.
Set up Docker and Docker Compose
Use the following commands to install Docker and Docker Compose on your system:
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
sudo usermod -aG docker $USER
Install NVIDIA Driver and Container Toolkit for GPU Servers
If you have servers with GPUs, you must install the NVIDIA driver with the container toolkit. Install the proper driver on your system with the commands below:
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers autoinstall
Reboot your system, and after that, check for NVIDIA drivers with the command below:
sudo reboot
nvidia-smi
To set up the container toolkit, you can run the following commands:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
| sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
| sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
| sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update && sudo apt install nvidia-container-toolkit -y
Configure Docker to recognize and use NVIDIA GPUs through the NVIDIA Container Toolkit with the command below:
sudo nvidia-ctk runtime configure --runtime=docker
Restart Docker to apply the changes:
sudo systemctl restart docker
You can run a quick GPU test with the command below:
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
Tip: For users who prefer ready-to-deploy GPU infrastructure rather than configuring drivers manually, consider using PerLod’s High-performance GPU Dedicated Servers.
Create Main Services With Docker Compose for ML Workflow
You need to create an Env file that sets environment variables to configure connections between the services, including PostgreSQL, MLflow, and MinIO. This allows MLflow to store experiments in PostgreSQL and artifacts in MinIO.
Create a working directory with the command below:
sudo mkdir -p ml-platform
Create the env file in your working directory by using the command below:
sudo nano ml-platform/.env
Add the following variables to the file:
# .env
POSTGRES_USER=platform
POSTGRES_PASSWORD=platformpass
POSTGRES_DB=platformdb
MLFLOW_BACKEND=postgresql://platform:platformpass@postgres:5432/platformdb
MLFLOW_S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadminsecret
MLFLOW_ARTIFACT_BUCKET=mlflow
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadminsecret
Then, create a Docker compose file for PostgreSQL, MinIO, and MLflow services with the command below:
sudo nano ml-platform/compose/docker-compose.core.yml
Add the following config to the file:
version: "3.9"
services:
postgres:
image: postgres:16
restart: unless-stopped
environment:
POSTGRES_USER: ${POSTGRES_USER}
POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
POSTGRES_DB: ${POSTGRES_DB}
volumes:
- pgdata:/var/lib/postgresql/data
networks: [core]
ports:
- "5432:5432"
minio:
image: minio/minio:latest
command: server /data --console-address ":9001"
environment:
MINIO_ROOT_USER: ${MINIO_ROOT_USER}
MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
volumes:
- minio:/data
networks: [core]
ports:
- "9000:9000"
- "9001:9001"
mlflow:
image: python:3.11-slim
restart: unless-stopped
depends_on: [postgres, minio]
working_dir: /app
environment:
MLFLOW_BACKEND_STORE_URI: ${MLFLOW_BACKEND}
MLFLOW_S3_ENDPOINT_URL: ${MLFLOW_S3_ENDPOINT_URL}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
BACKEND_BUCKET: ${MLFLOW_ARTIFACT_BUCKET}
command: >
sh -c "pip install --no-cache-dir mlflow[boto3]==2.14.1 psycopg2-binary==2.9.9 &&
mlflow server
--backend-store-uri ${MLFLOW_BACKEND_STORE_URI}
--default-artifact-root s3://${BACKEND_BUCKET}
--host 0.0.0.0 --port 5000"
networks: [core]
ports:
- "5000:5000"
volumes:
pgdata:
minio:
networks:
core:
This Docker compose file creates a complete ML platform with experiment tracking and artifact storage. You can check the official Docker Compose documentation for detailed configuration syntax, service options, and advanced networking setups.
Switch to the compose directory and run the Docker compose with the following commands:
cd ml-platform/compose
docker compose -f docker-compose.core.yml up -d
Initialize MLflow Artifact Storage in MinIO
At this point, you must configure MinIO object storage to store MLflow artifacts. It creates a dedicated bucket where MLflow will save experiment results, models, and other machine learning artifacts for tracking and versioning.
Download and install the MinIO client system-wide on your server with the following commands:
sudo wget https://dl.min.io/client/mc/release/linux-amd64/mc -O mc
sudo chmod +x mc && sudo mv mc /usr/local/bin/
Configure the connection to the MinIO server with the following command:
sudo mc alias set local http://127.0.0.1:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}
Create the bucket by using the command below:
mc mb local/${MLFLOW_ARTIFACT_BUCKET}
You can list the bucket with the following command:
mc ls local
You can access the MinIO console by navigating to the URL below:
http://server_ip:9001
Also, access MLflow Web UI with:
http://server_ip:5000
Set up DVC with MinIO for Data Versioning
Now you must configure DVC (Data Version Control) to use MinIO as remote storage for datasets and models. This will enable version control for large data files while storing them efficiently in S3-compatible storage.
Run the following commands from your repo root to set up DVC and configure MinIO as remote storage for DVC files:
git init
pipx install dvc 2>/dev/null || python3 -m pip install --user dvc[s3]
dvc init
# Configure S3 remote (MinIO)
dvc remote add -d s3remote s3://${MLFLOW_ARTIFACT_BUCKET}/dvc
dvc remote modify s3remote endpointurl http://127.0.0.1:9000
dvc remote modify s3remote access_key_id ${MINIO_ROOT_USER}
dvc remote modify s3remote secret_access_key ${MINIO_ROOT_PASSWORD}
dvc remote modify s3remote use_ssl false
git add .dvc .gitignore
git commit -m "Init DVC with MinIO remote"
Then, track and version a dataset with DVC with the following commands:
mkdir -p data/raw && cp /path/to/your.csv data/raw/
dvc add data/raw/your.csv
git add data/raw/your.csv.dvc
git commit -m "Track dataset with DVC"
dvc push
ML Model Training with MLflow Experiment Tracking
In this step, we will show you machine learning training with comprehensive MLflow logging. It includes a complete workflow for training a classifier while automatically tracking parameters, metrics, and models to the MLflow server for experiment management and reproducibility.
Create the requirements file with the following command:
sudo nano ml-platform/ml/requirements.txt
Add the following variables to the file:
numpy
pandas
scikit-learn
mlflow==2.14.1
boto3
Then, create the train file with the following command:
sudo nano ml-platform/ml/train.py
Add the following config to the file:
import os
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
MLFLOW_TRACKING_URI = os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000")
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment("demo-exp")
def main():
# Load data (could be pulled with DVC)
df = pd.read_csv("data/raw/your.csv")
X = df.drop(columns=["label"])
y = df["label"]
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)
with mlflow.start_run():
params = {"C": 1.0, "max_iter": 200}
model = LogisticRegression(**params).fit(Xtr, ytr)
preds = model.predict(Xte)
f1 = f1_score(yte, preds, average="macro")
mlflow.log_params(params)
mlflow.log_metric("f1_macro", f1)
mlflow.sklearn.log_model(model, artifact_path="model")
print(f"F1 (macro): {f1:.4f}")
if __name__ == "__main__":
main()
Now you can run the training model locally in a Python environment shell:
python3 -m venv .venv && . .venv/bin/activate
pip install -r ml/requirements.txt
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
python ml/train.py
From the MLflow Web UI, you can verify the model that appears under the run’s artifacts.
ML Pipeline Automation with Apache Airflow
At this point, you can automate machine learning workflows using Apache Airflow. This setup will create a complete automation system that can pull data with DVC, run training jobs, and log experiments to MLflow, all managed through scheduled pipelines with monitoring and retry capabilities.
Create the Airflow requirements file with the command below:
sudo nano ml-platform/airflow/requirements.txt
Add the following variables to the file:
apache-airflow-providers-http
apache-airflow-providers-cncf-kubernetes
boto3
mlflow==2.14.1
dvc[s3]
Then, create the Airflow Docker Compose file with the command below:
sudo nano ml-platform/compose/docker-compose.airflow.yml
Add the following configuration to the file:
version: "3.9"
x-airflow-common: &airflow-common
image: apache/airflow:2.9.3
environment:
AIRFLOW__CORE__LOAD_EXAMPLES: "False"
AIRFLOW__CORE__EXECUTOR: LocalExecutor
AIRFLOW__CORE__FERNET_KEY: "generate_a_fernet_key_and_put_here"
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://platform:platformpass@postgres:5432/airflow
_PIP_ADDITIONAL_REQUIREMENTS: "apache-airflow-providers-http apache-airflow-providers-cncf-kubernetes boto3 mlflow==2.14.1 dvc[s3]"
# For MLflow & MinIO access inside tasks:
MLFLOW_TRACKING_URI: "http://mlflow:5000"
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
MLFLOW_S3_ENDPOINT_URL: ${MLFLOW_S3_ENDPOINT_URL}
volumes:
- ../airflow/dags:/opt/airflow/dags
- ../ml:/opt/airflow/ml
depends_on:
- postgres
- minio
- mlflow
networks: [core]
services:
airflow-webserver:
<<: *airflow-common
command: webserver
ports: ["8080:8080"]
airflow-scheduler:
<<: *airflow-common
command: scheduler
airflow-init:
<<: *airflow-common
entrypoint: /bin/bash
command: -c "airflow db migrate && airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email ad***@*****le.com"
networks:
core:
external: true
Once you are done, navigate to the Compose directory and run the Airflow Compose file:
cd ml-platform/compose
docker compose -f docker-compose.airflow.yml run --rm airflow-init
docker compose -f docker-compose.airflow.yml up -d
You can access the Airflow Web UI and change the admin credentials from the init command:
http://server_ip:8080
Now, create the Airflow DAG (Directed Acyclic Graph) that defines a machine learning pipeline with the command below:
sudo nano ml-platform/airflow/dags/train_example.py
Add the following ML pipeline to the file:
from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator
default_args = {"owner": "you", "retries": 0}
with DAG(
dag_id="train_example",
default_args=default_args,
start_date=datetime(2025, 1, 1),
schedule_interval=None,
catchup=False
) as dag:
# Pull latest data with DVC (optional)
dvc_pull = BashOperator(
task_id="dvc_pull",
bash_command="""
cd /opt/airflow && \
dvc pull -v
"""
)
# Install ML deps (in-Task; or bake into custom image)
pip_install = BashOperator(
task_id="pip_install",
bash_command="pip install -r /opt/airflow/ml/requirements.txt --no-cache-dir"
)
# Run training & log to MLflow
train = BashOperator(
task_id="train",
env={
"MLFLOW_TRACKING_URI": "http://mlflow:5000",
"AWS_ACCESS_KEY_ID": "{{ var.value.AWS_ACCESS_KEY_ID if var.value.AWS_ACCESS_KEY_ID else '' }}",
"AWS_SECRET_ACCESS_KEY": "{{ var.value.AWS_SECRET_ACCESS_KEY if var.value.AWS_SECRET_ACCESS_KEY else '' }}",
"MLFLOW_S3_ENDPOINT_URL": "http://minio:9000",
},
bash_command="""
cd /opt/airflow && \
python ml/train.py
"""
)
dvc_pull >> pip_install >> train
You can run the Airflow DAG using the Airflow REST API:
curl -X POST "http://localhost:8080/api/v1/dags/train_example/dagRuns" \
-u admin:admin \
-H "Content-Type: application/json" \
-d '{"conf": {"run_id": "manual_1"}}'
Tip: To expand automation beyond ML workflows and include infrastructure and hosting operations, check out the AI Automated Hosting Operations tutorial.
CI/CD Pipeline for ML Platform with GitHub Actions
You can also set up a GitHub Actions workflow, which automates testing, containerization, and deployment of machine learning code. It runs tests, builds Docker images, pushes to a container registry, and can optionally trigger Airflow pipelines.
Create the GitHub actions file with the following command:
sudo nano .github/workflows/ci.yml
Add the following config to the file:
name: ci
on:
push:
branches: [ "main" ]
jobs:
test-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: python -m pip install -r ml/requirements.txt
- run: python -m pip install pytest
- run: pytest -q || true # put real tests here
# Build and push image with your registry (example: GHCR)
- name: Log in to GHCR
run: echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
- name: Build
run: docker build -t ghcr.io/${{ github.repository }}/ml-app:latest -f Dockerfile .
- name: Push
run: docker push ghcr.io/${{ github.repository }}/ml-app:latest
# Optional: trigger Airflow DAG
- name: Trigger Airflow
run: |
curl -X POST "http://YOUR_AIRFLOW_URL/api/v1/dags/train_example/dagRuns" \
-u "${{ secrets.AIRFLOW_USER }}:${{ secrets.AIRFLOW_PASS }}" \
-H "Content-Type: application/json" \
-d '{"conf":{"source":"ci"}}'
If you prefer containerized tasks, you can create a simple Dockerfile to package your training code:
FROM python:3.11-slim
WORKDIR /app
COPY ml/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ml/ /app/ml/
CMD ["python", "ml/train.py"]
ML Model Serving with NVIDIA Triton Inference Server
You can deploy machine learning models for high-performance inference using NVIDIA Triton. It provides a production-ready serving solution with support for multiple frameworks such as TensorFlow, PyTorch, ONNX, and optimized GPU inference through a scalable inference server.
You can start a NVIDIA Triton Inference Server with GPU support by using the following commands:
mkdir -p serving/models
# Put your model repo under serving/models/<model_name>/ (config.pbtxt inside)
docker run -d --name triton \
--gpus all \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $PWD/serving/models:/models \
nvcr.io/nvidia/tritonserver:24.05-py3 \
tritonserver --model-repository=/models
There is an alternative method with a simple model served with FastAPI and Docker. It creates RESTful APIs for inference that can be containerized and deployed behind NGINX, which offers a straightforward solution for production deployment without complex model conversion.
Tip: For a deeper guide on configuring and optimizing PyTorch inference environments on virtual private servers, check PyTorch Model Inference Setup on VPS.
ML Platform Monitoring with Prometheus and Grafana
Monitoring the ML workflow is an essential task. This setup collects metrics from containers, hosts, and services using Prometheus, visualizes them in Grafana dashboards, and monitors resource usage, which gives you full visibility into system performance and model serving metrics.
Create the monitoring Compose file with the command below:
sudo nano ml-platform/compose/docker-compose.monitoring.yml
Add the following configuration to the file:
version: "3.9"
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
networks: [core]
ports: ["9090:9090"]
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
privileged: true
networks: [core]
ports: ["5555:8080"]
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
grafana:
image: grafana/grafana:latest
networks: [core]
ports: ["3000:3000"]
volumes:
- grafana:/var/lib/grafana
volumes:
grafana:
networks:
core:
external: true
Then, create the Prometheus file with the command below:
sudo nano ml-platform/compose/prometheus.yml
Add the configuration to the file:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs: [{ targets: ['prometheus:9090'] }]
- job_name: 'cadvisor'
static_configs: [{ targets: ['cadvisor:8080'] }]
- job_name: 'triton'
static_configs: [{ targets: ['HOST_IP_OR_TRITON:8002'] }]
Navigate to the compose directory and bring up the monitoring:
cd ml-platform/compose
docker compose -f docker-compose.monitoring.yml up -d
Access the Grafana dashboard with the following URL and change the default admin credentials:
http://server_ip:3000
You can connect Prometheus as your data source, and import pre-built dashboards to monitor Docker containers, host resources, and system performance in real-time.
For additional insights on maximizing performance and resource utilization, you can refer to Optimizing VPS Resource Allocation with AI.
That’s it, you are done with setting up Automated ML workflows on dedicated servers.
FAQs
What is an automated ML workflow?
An automated ML workflow uses tools like Airflow, MLflow, and DVC to create repeatable processes from data to deployment, which reduces errors and speeds up experimentation.
Why should I use dedicated infrastructure for ML service?
A dedicated server gives you full control over resource allocation, data governance, and cost optimization, which is perfect for ML services.
What’s the difference between MLflow and DVC?
MLflow manages experiments, model metrics, and artifact storage, while DVC tracks dataset and model versions linked to Git commits.
Conclusion
Building automated ML workflows on dedicated servers gives you the power to manage the entire ML platform from data collection to production deployment, with full visibility, reproducibility, and scalability. By using tools like Docker, Airflow, MLflow, DVC, and Prometheus, you create a powerful foundation for continuous experimentation and model delivery.
This tutorial was tested and deployed on PerLod Hosting, which offered optimized VPS and dedicated GPU servers for machine learning, AI automation, and high-performance computing.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles in machine learning.