Automated ML Workflow via Dedicated Infrastructure

Automated ML workflows on dedicated servers

Automated ML Workflow via Dedicated Infrastructure

Using Automated ML workflows on dedicated servers provides a reliable and high-performance environment for training and deploying machine learning models. Dedicated servers are the best choice for ML workflows because they ensure consistent computing power, which avoids slowdowns. Also, they offer better scalability for large datasets and complex models.

In this guide, we will use Ubuntu 22.04 or Ubuntu 24.04 on a reliable and high-performance dedicated server, and use Docker and Docker Compose to set up an Automated ML Workflow.

System Prerequisites for Automated ML Workflows on Dedicated Servers

The automated ML workflow we want to set up in this guide has several key components, including:

  • MinIO (S3) for storing datasets and artifacts.
  • PostgreSQL for metadata management.
  • MLflow for experiment tracking and model registry.
  • Apache Airflow for automation.
  • DVC for data versioning.
  • Optionally, NVIDIA Triton can be added for model serving.
  • Prometheus, Grafana, and cAdvisor handle monitoring.

Before setting up this workflow, you need to prepare your operating system by configuring firewall rules, installing Docker, and other necessary steps.

Install Required Packages and Configure Firewall Rules

Run the system upgrade and install the required packages on your system with the following commands:

sudo apt update && sudo apt upgrade -y
sudo apt install curl ufw ca-certificates gnupg lsb-release jq unzip git -y

Allow the necessary ports through your firewall and enable the firewall with the commands below:

sudo ufw allow 9000,9001,5000,8080,5432,9090,3000,5555/tcp
sudo ufw enable

Note: In a production environment, allow only within your LAN or VPN.

Set up Docker and Docker Compose

Use the following commands to install Docker and Docker Compose on your system:

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
 | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \
 | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y

sudo usermod -aG docker $USER

Install NVIDIA Driver and Container Toolkit for GPU Servers

If you have servers with GPUs, you must install the NVIDIA driver with the container toolkit. Install the proper driver on your system with the commands below:

sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers autoinstall

Reboot your system, and after that, check for NVIDIA drivers with the command below:

sudo reboot
nvidia-smi

To set up the container toolkit, you can run the following commands:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
 | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
curl -fsSL https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list \
 | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#' \
 | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update && sudo apt install nvidia-container-toolkit -y

Configure Docker to recognize and use NVIDIA GPUs through the NVIDIA Container Toolkit with the command below:

sudo nvidia-ctk runtime configure --runtime=docker

Restart Docker to apply the changes:

sudo systemctl restart docker

You can run a quick GPU test with the command below:

docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi

Tip: For users who prefer ready-to-deploy GPU infrastructure rather than configuring drivers manually, consider using PerLod’s High-performance GPU Dedicated Servers.

Create Main Services With Docker Compose for ML Workflow

You need to create an Env file that sets environment variables to configure connections between the services, including PostgreSQL, MLflow, and MinIO. This allows MLflow to store experiments in PostgreSQL and artifacts in MinIO.

Create a working directory with the command below:

sudo mkdir -p ml-platform

Create the env file in your working directory by using the command below:

sudo nano ml-platform/.env

Add the following variables to the file:

# .env
POSTGRES_USER=platform
POSTGRES_PASSWORD=platformpass
POSTGRES_DB=platformdb

MLFLOW_BACKEND=postgresql://platform:platformpass@postgres:5432/platformdb
MLFLOW_S3_ENDPOINT_URL=http://minio:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadminsecret
MLFLOW_ARTIFACT_BUCKET=mlflow

MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadminsecret

Then, create a Docker compose file for PostgreSQL, MinIO, and MLflow services with the command below:

sudo nano ml-platform/compose/docker-compose.core.yml

Add the following config to the file:

version: "3.9"
services:
  postgres:
    image: postgres:16
    restart: unless-stopped
    environment:
      POSTGRES_USER: ${POSTGRES_USER}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DB}
    volumes:
      - pgdata:/var/lib/postgresql/data
    networks: [core]
    ports:
      - "5432:5432"

  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${MINIO_ROOT_USER}
      MINIO_ROOT_PASSWORD: ${MINIO_ROOT_PASSWORD}
    volumes:
      - minio:/data
    networks: [core]
    ports:
      - "9000:9000"
      - "9001:9001"

  mlflow:
    image: python:3.11-slim
    restart: unless-stopped
    depends_on: [postgres, minio]
    working_dir: /app
    environment:
      MLFLOW_BACKEND_STORE_URI: ${MLFLOW_BACKEND}
      MLFLOW_S3_ENDPOINT_URL: ${MLFLOW_S3_ENDPOINT_URL}
      AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
      AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
      BACKEND_BUCKET: ${MLFLOW_ARTIFACT_BUCKET}
    command: >
      sh -c "pip install --no-cache-dir mlflow[boto3]==2.14.1 psycopg2-binary==2.9.9 &&
             mlflow server
               --backend-store-uri ${MLFLOW_BACKEND_STORE_URI}
               --default-artifact-root s3://${BACKEND_BUCKET}
               --host 0.0.0.0 --port 5000"
    networks: [core]
    ports:
      - "5000:5000"

volumes:
  pgdata:
  minio:

networks:
  core:

This Docker compose file creates a complete ML platform with experiment tracking and artifact storage. You can check the official Docker Compose documentation for detailed configuration syntax, service options, and advanced networking setups.

Switch to the compose directory and run the Docker compose with the following commands:

cd ml-platform/compose
docker compose -f docker-compose.core.yml up -d

Initialize MLflow Artifact Storage in MinIO

At this point, you must configure MinIO object storage to store MLflow artifacts. It creates a dedicated bucket where MLflow will save experiment results, models, and other machine learning artifacts for tracking and versioning.

Download and install the MinIO client system-wide on your server with the following commands:

sudo wget https://dl.min.io/client/mc/release/linux-amd64/mc -O mc
sudo chmod +x mc && sudo mv mc /usr/local/bin/

Configure the connection to the MinIO server with the following command:

sudo mc alias set local http://127.0.0.1:9000 ${MINIO_ROOT_USER} ${MINIO_ROOT_PASSWORD}

Create the bucket by using the command below:

mc mb local/${MLFLOW_ARTIFACT_BUCKET}

You can list the bucket with the following command:

mc ls local

You can access the MinIO console by navigating to the URL below:

http://server_ip:9001

Also, access MLflow Web UI with:

http://server_ip:5000

Set up DVC with MinIO for Data Versioning

Now you must configure DVC (Data Version Control) to use MinIO as remote storage for datasets and models. This will enable version control for large data files while storing them efficiently in S3-compatible storage.

Run the following commands from your repo root to set up DVC and configure MinIO as remote storage for DVC files:

git init
pipx install dvc 2>/dev/null || python3 -m pip install --user dvc[s3]
dvc init

# Configure S3 remote (MinIO)
dvc remote add -d s3remote s3://${MLFLOW_ARTIFACT_BUCKET}/dvc
dvc remote modify s3remote endpointurl http://127.0.0.1:9000
dvc remote modify s3remote access_key_id ${MINIO_ROOT_USER}
dvc remote modify s3remote secret_access_key ${MINIO_ROOT_PASSWORD}
dvc remote modify s3remote use_ssl false
git add .dvc .gitignore
git commit -m "Init DVC with MinIO remote"

Then, track and version a dataset with DVC with the following commands:

mkdir -p data/raw && cp /path/to/your.csv data/raw/
dvc add data/raw/your.csv
git add data/raw/your.csv.dvc
git commit -m "Track dataset with DVC"
dvc push

ML Model Training with MLflow Experiment Tracking

In this step, we will show you machine learning training with comprehensive MLflow logging. It includes a complete workflow for training a classifier while automatically tracking parameters, metrics, and models to the MLflow server for experiment management and reproducibility.

Create the requirements file with the following command:

sudo nano ml-platform/ml/requirements.txt

Add the following variables to the file:

numpy
pandas
scikit-learn
mlflow==2.14.1
boto3

Then, create the train file with the following command:

sudo nano ml-platform/ml/train.py

Add the following config to the file:

import os
import mlflow
import mlflow.sklearn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

MLFLOW_TRACKING_URI = os.environ.get("MLFLOW_TRACKING_URI", "http://localhost:5000")
mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)
mlflow.set_experiment("demo-exp")

def main():
    # Load data (could be pulled with DVC)
    df = pd.read_csv("data/raw/your.csv")
    X = df.drop(columns=["label"])
    y = df["label"]

    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=42)

    with mlflow.start_run():
        params = {"C": 1.0, "max_iter": 200}
        model = LogisticRegression(**params).fit(Xtr, ytr)
        preds = model.predict(Xte)
        f1 = f1_score(yte, preds, average="macro")

        mlflow.log_params(params)
        mlflow.log_metric("f1_macro", f1)
        mlflow.sklearn.log_model(model, artifact_path="model")

        print(f"F1 (macro): {f1:.4f}")

if __name__ == "__main__":
    main()

Now you can run the training model locally in a Python environment shell:

python3 -m venv .venv && . .venv/bin/activate
pip install -r ml/requirements.txt
export MLFLOW_TRACKING_URI=http://127.0.0.1:5000
python ml/train.py

From the MLflow Web UI, you can verify the model that appears under the run’s artifacts.

ML Pipeline Automation with Apache Airflow

At this point, you can automate machine learning workflows using Apache Airflow. This setup will create a complete automation system that can pull data with DVC, run training jobs, and log experiments to MLflow, all managed through scheduled pipelines with monitoring and retry capabilities.

Create the Airflow requirements file with the command below:

sudo nano ml-platform/airflow/requirements.txt

Add the following variables to the file:

apache-airflow-providers-http
apache-airflow-providers-cncf-kubernetes
boto3
mlflow==2.14.1
dvc[s3]

Then, create the Airflow Docker Compose file with the command below:

sudo nano ml-platform/compose/docker-compose.airflow.yml

Add the following configuration to the file:

version: "3.9"
x-airflow-common: &airflow-common
  image: apache/airflow:2.9.3
  environment:
    AIRFLOW__CORE__LOAD_EXAMPLES: "False"
    AIRFLOW__CORE__EXECUTOR: LocalExecutor
    AIRFLOW__CORE__FERNET_KEY: "generate_a_fernet_key_and_put_here"
    AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://platform:platformpass@postgres:5432/airflow
    _PIP_ADDITIONAL_REQUIREMENTS: "apache-airflow-providers-http apache-airflow-providers-cncf-kubernetes boto3 mlflow==2.14.1 dvc[s3]"
    # For MLflow & MinIO access inside tasks:
    MLFLOW_TRACKING_URI: "http://mlflow:5000"
    AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID}
    AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY}
    MLFLOW_S3_ENDPOINT_URL: ${MLFLOW_S3_ENDPOINT_URL}
  volumes:
    - ../airflow/dags:/opt/airflow/dags
    - ../ml:/opt/airflow/ml
  depends_on:
    - postgres
    - minio
    - mlflow
  networks: [core]
services:
  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports: ["8080:8080"]
  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
  airflow-init:
    <<: *airflow-common
    entrypoint: /bin/bash
    command: -c "airflow db migrate && airflow users create --username admin --password admin --firstname Admin --lastname User --role Admin --email ad***@*****le.com"

networks:
  core:
    external: true

Once you are done, navigate to the Compose directory and run the Airflow Compose file:

cd ml-platform/compose
docker compose -f docker-compose.airflow.yml run --rm airflow-init
docker compose -f docker-compose.airflow.yml up -d

You can access the Airflow Web UI and change the admin credentials from the init command:

http://server_ip:8080

Now, create the Airflow DAG (Directed Acyclic Graph) that defines a machine learning pipeline with the command below:

sudo nano ml-platform/airflow/dags/train_example.py

Add the following ML pipeline to the file:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator

default_args = {"owner": "you", "retries": 0}

with DAG(
    dag_id="train_example",
    default_args=default_args,
    start_date=datetime(2025, 1, 1),
    schedule_interval=None,
    catchup=False
) as dag:

    # Pull latest data with DVC (optional)
    dvc_pull = BashOperator(
        task_id="dvc_pull",
        bash_command="""
        cd /opt/airflow && \
        dvc pull -v
        """
    )

    # Install ML deps (in-Task; or bake into custom image)
    pip_install = BashOperator(
        task_id="pip_install",
        bash_command="pip install -r /opt/airflow/ml/requirements.txt --no-cache-dir"
    )

    # Run training & log to MLflow
    train = BashOperator(
        task_id="train",
        env={
          "MLFLOW_TRACKING_URI": "http://mlflow:5000",
          "AWS_ACCESS_KEY_ID": "{{ var.value.AWS_ACCESS_KEY_ID if var.value.AWS_ACCESS_KEY_ID else '' }}",
          "AWS_SECRET_ACCESS_KEY": "{{ var.value.AWS_SECRET_ACCESS_KEY if var.value.AWS_SECRET_ACCESS_KEY else '' }}",
          "MLFLOW_S3_ENDPOINT_URL": "http://minio:9000",
        },
        bash_command="""
        cd /opt/airflow && \
        python ml/train.py
        """
    )

    dvc_pull >> pip_install >> train

You can run the Airflow DAG using the Airflow REST API:

curl -X POST "http://localhost:8080/api/v1/dags/train_example/dagRuns" \
-u admin:admin \
-H "Content-Type: application/json" \
-d '{"conf": {"run_id": "manual_1"}}'

Tip: To expand automation beyond ML workflows and include infrastructure and hosting operations, check out the AI Automated Hosting Operations tutorial.

CI/CD Pipeline for ML Platform with GitHub Actions

You can also set up a GitHub Actions workflow, which automates testing, containerization, and deployment of machine learning code. It runs tests, builds Docker images, pushes to a container registry, and can optionally trigger Airflow pipelines.

Create the GitHub actions file with the following command:

sudo nano .github/workflows/ci.yml

Add the following config to the file:

name: ci
on:
  push:
    branches: [ "main" ]
jobs:
  test-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with: { python-version: "3.11" }
      - run: python -m pip install -r ml/requirements.txt
      - run: python -m pip install pytest
      - run: pytest -q || true  # put real tests here

      # Build and push image with your registry (example: GHCR)
      - name: Log in to GHCR
        run: echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
      - name: Build
        run: docker build -t ghcr.io/${{ github.repository }}/ml-app:latest -f Dockerfile .
      - name: Push
        run: docker push ghcr.io/${{ github.repository }}/ml-app:latest

      # Optional: trigger Airflow DAG
      - name: Trigger Airflow
        run: |
          curl -X POST "http://YOUR_AIRFLOW_URL/api/v1/dags/train_example/dagRuns" \
            -u "${{ secrets.AIRFLOW_USER }}:${{ secrets.AIRFLOW_PASS }}" \
            -H "Content-Type: application/json" \
            -d '{"conf":{"source":"ci"}}'

If you prefer containerized tasks, you can create a simple Dockerfile to package your training code:

FROM python:3.11-slim
WORKDIR /app
COPY ml/requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY ml/ /app/ml/
CMD ["python", "ml/train.py"]

ML Model Serving with NVIDIA Triton Inference Server

You can deploy machine learning models for high-performance inference using NVIDIA Triton. It provides a production-ready serving solution with support for multiple frameworks such as TensorFlow, PyTorch, ONNX, and optimized GPU inference through a scalable inference server.

You can start a NVIDIA Triton Inference Server with GPU support by using the following commands:

mkdir -p serving/models
# Put your model repo under serving/models/<model_name>/ (config.pbtxt inside)

docker run -d --name triton \
  --gpus all \
  -p 8000:8000 -p 8001:8001 -p 8002:8002 \
  -v $PWD/serving/models:/models \
  nvcr.io/nvidia/tritonserver:24.05-py3 \
  tritonserver --model-repository=/models

There is an alternative method with a simple model served with FastAPI and Docker. It creates RESTful APIs for inference that can be containerized and deployed behind NGINX, which offers a straightforward solution for production deployment without complex model conversion.

Tip: For a deeper guide on configuring and optimizing PyTorch inference environments on virtual private servers, check PyTorch Model Inference Setup on VPS.

ML Platform Monitoring with Prometheus and Grafana

Monitoring the ML workflow is an essential task. This setup collects metrics from containers, hosts, and services using Prometheus, visualizes them in Grafana dashboards, and monitors resource usage, which gives you full visibility into system performance and model serving metrics.

Create the monitoring Compose file with the command below:

sudo nano ml-platform/compose/docker-compose.monitoring.yml

Add the following configuration to the file:

version: "3.9"
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    networks: [core]
    ports: ["9090:9090"]

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    privileged: true
    networks: [core]
    ports: ["5555:8080"]
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

  grafana:
    image: grafana/grafana:latest
    networks: [core]
    ports: ["3000:3000"]
    volumes:
      - grafana:/var/lib/grafana

volumes:
  grafana:

networks:
  core:
    external: true

Then, create the Prometheus file with the command below:

sudo nano ml-platform/compose/prometheus.yml

Add the configuration to the file:

global:
  scrape_interval: 15s
scrape_configs:
  - job_name: 'prometheus'
    static_configs: [{ targets: ['prometheus:9090'] }]
  - job_name: 'cadvisor'
    static_configs: [{ targets: ['cadvisor:8080'] }]
  - job_name: 'triton'
    static_configs: [{ targets: ['HOST_IP_OR_TRITON:8002'] }]

Navigate to the compose directory and bring up the monitoring:

cd ml-platform/compose
docker compose -f docker-compose.monitoring.yml up -d

Access the Grafana dashboard with the following URL and change the default admin credentials:

http://server_ip:3000

You can connect Prometheus as your data source, and import pre-built dashboards to monitor Docker containers, host resources, and system performance in real-time.

For additional insights on maximizing performance and resource utilization, you can refer to Optimizing VPS Resource Allocation with AI.

That’s it, you are done with setting up Automated ML workflows on dedicated servers.

FAQs

What is an automated ML workflow?

An automated ML workflow uses tools like Airflow, MLflow, and DVC to create repeatable processes from data to deployment, which reduces errors and speeds up experimentation.

Why should I use dedicated infrastructure for ML service?

A dedicated server gives you full control over resource allocation, data governance, and cost optimization, which is perfect for ML services.

What’s the difference between MLflow and DVC?

MLflow manages experiments, model metrics, and artifact storage, while DVC tracks dataset and model versions linked to Git commits.

Conclusion

Building automated ML workflows on dedicated servers gives you the power to manage the entire ML platform from data collection to production deployment, with full visibility, reproducibility, and scalability. By using tools like Docker, Airflow, MLflow, DVC, and Prometheus, you create a powerful foundation for continuous experimentation and model delivery.

This tutorial was tested and deployed on PerLod Hosting, which offered optimized VPS and dedicated GPU servers for machine learning, AI automation, and high-performance computing.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles in machine learning.

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.