Setting Up AIOps Tools for Server Monitoring
In this tutorial, we want to discover how to set up an AIOps stack for server monitoring. AIOps stack is a combination of tools that use AI and automation to monitor, analyze, and manage servers. It is a great tool for detecting issues, predicting failures, and automating responses. In this guide, we will use the following tools for the AIOps stack:
- Metrics: Prometheus, Node Exporter, and Blackbox Exporter.
- Logs: Loki and Promtail.
- Visualization and alerting UI: Grafana.
- Alert routing: Alertmanager.
- AIOps: A small Python service that fetches Prometheus metrics, detects anomalies, and triggers Alertmanager via webhook.
In this guide, we use Ubuntu 24.04 server from PerLod Hodting, where you can find support for server monitoring with AIOps.
Table of Contents
Requirements for AIOps Stack for Server Monitoring – AIOps Tools
To complete the guide steps, you need a fresh Ubuntu 24.04 VM with 4 vCPU, 8 GB RAM, and 100 GB disk. Also, you must open the required firewall ports, including:
Prometheus: 9090
Alertmanager: 9093
Loki: 3100
Grafana: 3000
Port 22
And 80/443, if you use a reverse proxy.
Remember to set the correct timezone on your server:
sudo timedatectl set-timezone Asia/Dubai
Once you are done, proceed to the next step to install Docker and Docker Compose, which are used for setting up an AIOps stack.
Install Docker and Docker Compose For Setting up an AIOps Stack
Run the system update and install the required packages with the commands below:
sudo apt update
sudo apt install ca-certificates curl gnupg -y
Add Docker GPG key and repository to your server with the following commands:
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu noble stable" \
| sudo tee /etc/apt/sources.list.d/docker.list >/dev/null
Again, run the system update and install Docker and Docker Compose with:
sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y
You can add your user to the Docker group with the command below:
sudo usermod -aG docker $USER
Log out and log in again to apply the changes.
Create AIOps Stack Directory
You must create the AIOps stack directory and the required tools path in it. To do this, you can run the command below:
sudo mkdir -p /opt/aiops/{prometheus,alertmanager,grafana-provisioning/{datasources,dashboards},loki,promtail,blackbox,anomaly-detector}
Set the correct ownership for the AIOps stack directory with:
sudo chown -R $USER:$USER /opt/aiops
Switch to the AIOps stack directory:
cd /opt/aiops
Prometheus Configuration for AIOps Stack
Prometheus is the main monitoring engine. The first step is to configure Prometheus server monitoring by creating the Prometheus YAML file with the following command:
sudo nano /opt/aiops/prometheus/prometheus.yml
Add the following configuration to the file:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["prometheus:9090"]
- job_name: "node-exporters"
static_configs:
# Add your monitored servers' IPs (with :9100) here
- targets: ["10.0.0.11:9100","10.0.0.12:9100"]
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://your-api.example
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox:9115
Also, you can create a simple alert rule in the Prometheus rule file:
sudo nano /opt/aiops/prometheus/rules/infra.yml
Add the following alert rule to the file:
groups:
- name: infra
rules:
- alert: NodeDown
expr: up{job="node-exporters"} == 0
for: 2m
labels: {severity: critical}
annotations:
summary: "Node down: {{ $labels.instance }}"
description: "No scrape data for 2m."
- alert: HighCPU
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels: {severity: warning}
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU usage >85% for 5m."
- alert: DiskFilling
expr: (node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} - node_filesystem_free_bytes{fstype!~"tmpfs|overlay"})
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay"} > 0.9
for: 10m
labels: {severity: warning}
annotations:
summary: "Disk >90% on {{ $labels.instance }}"
description: "Filesystem filling up."
Tip: To learn more about Prometheus Linux server monitoring, you can check this guide on Monitoring a Linux host using Prometheus.
Alertmanager Configuration for AIOps Stack
The next step is to set up the Alertmanager for server monitoring, which handles notifications from Prometheus. Create the Alertmanager YAML file with:
sudo nano /opt/aiops/alertmanager/alertmanager.yml
Add the following configuration to the file, you can replace SMTP with real credentials or remove the email if not needed yet:
route:
receiver: default
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
receivers:
- name: default
email_configs:
- to: "[email protected]"
from: "[email protected]"
smarthost: "smtp.example.com:587"
auth_username: "[email protected]"
auth_identity: "[email protected]"
auth_password: "CHANGE_ME"
webhook_configs:
- url: "http://anomaly-detector:8080/alertmanager" # our AIOps webhook
send_resolved: true
Loki and Promtail Log Configuration for AIOps Stack
Loki is a log aggregation system, and Promtail is its log collector. Promtail gathers logs from your server and sends them to Loki, which makes it easy to search logs in Grafana alongside metrics.
Create the Loki YAML file with the following command:
sudo nano /opt/aiops/loki/config.yml
Add the following configuration to it:
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
ring:
instance_addr: 127.0.0.1
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/boltdb-cache
filesystem:
directory: /loki/chunks
limits_config:
ingestion_burst_size_mb: 64
ingestion_rate_mb: 32
max_cache_freshness_per_query: 10m
chunk_store_config:
max_look_back_period: 720h
table_manager:
retention_deletes_enabled: true
retention_period: 720h
Also, create the Promtail file with the command below:
sudo nano /opt/aiops/promtail/config.yml
Add the following config to the file:
server:
http_listen_port: 9080
positions:
filename: /positions.yml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: system-logs
static_configs:
- targets: [localhost]
labels:
job: varlogs
host: ${HOSTNAME}
__path__: /var/log/*.log
Configure Blackbox Exporter for AIOps Stack
Blackbox Exporter checks external endpoints like APIs or websites using HTTP probes, which helps you monitor the availability and response of your web services.
At this point, you can create the Blackbox exporter YAML file with the following command:
sudo nano /opt/aiops/blackbox/blackbox.yml
Add the following configuration to the file:
modules:
http_2xx:
prober: http
http:
preferred_ip_protocol: ip4
no_follow_redirects: false
fail_if_not_ssl: false
fail_if_body_not_matches_regexp:
- ".+"
Grafana Configuration for AIOps Stack
You can connect Grafana to Prometheus and Loki as data sources, and load dashboards to visualize server performance, logs, and alerts in one place.
To create the Grafana data sources YAML file, run the command below:
sudo nano /opt/aiops/grafana-provisioning/datasources/datasource.yml
Add the following configuration to the file:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus:9090
access: proxy
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
access: proxy
jsonData:
maxLines: 1000
Optional Note: You can drop JSON dashboards in grafana-provisioning/dashboards/ and add a dashboards.yml provisioning file.
Create the Garfana dashboards YAML file with:
sudo nano /opt/aiops/grafana-provisioning/dashboards/dashboards.yml
Add the following config to the file:
apiVersion: 1
providers:
- name: 'Default'
orgId: 1
folder: 'AIOps'
type: file
options:
path: /etc/grafana/dashboards
AIOps Microservice: Anomaly Detector Configuration
This custom Python microservice analyzes Prometheus metrics in real time using STL decomposition and z-score detection to identify anomalies. When it detects unusual behavior, it sends a synthetic alert to Alertmanager automatically.
Create the requirements file with the following command:
sudo nano /opt/aiops/anomaly-detector/requirements.txt
Add the following requirements to the file:
flask==3.0.3
requests==2.32.3
pandas==2.2.3
numpy==2.1.1
statsmodels==0.14.3
Then, create the Anomaly detector script file with:
sudo anno /opt/aiops/anomaly-detector/app.py
Add the following script to the file:
from flask import Flask, request, jsonify
import os, time, requests, json
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from statsmodels.tsa.seasonal import STL
PROM = os.getenv("PROM_URL", "http://prometheus:9090")
ALERTMAN = os.getenv("ALERTMAN_URL", "http://alertmanager:9093/api/v2/alerts")
QUERY = os.getenv("PROM_QUERY", '100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100)')
app = Flask(__name__)
def fetch_series(minutes=120, step="30s"):
end = int(time.time())
start = end - minutes*60
url = f"{PROM}/api/v1/query_range"
r = requests.get(url, params={"query": QUERY, "start": start, "end": end, "step": step}, timeout=30)
r.raise_for_status()
return r.json()
def stl_anomaly(values):
# values: list of [timestamp, value]
if len(values) < 60:
return None
ts = pd.Series([float(v[1]) for v in values])
# STL requires a period; 60 points ~ 30 min if step=30s; tune as needed
stl = STL(ts, period=60, robust=True).fit()
resid = stl.resid
z = (resid - resid.mean()) / (resid.std() + 1e-9)
# flag last point if it is an outlier
if abs(z.iloc[-1]) > 3.5:
return float(ts.iloc[-1]), float(z.iloc[-1])
return None
def push_alert(instance, value, zscore):
payload = [{
"labels": {
"alertname": "AIOpsDetectedAnomaly",
"severity": "warning",
"instance": instance
},
"annotations": {
"summary": f"AIOps anomaly on {instance}",
"description": f"Value={value:.2f}, z-score={zscore:.2f} on query: {QUERY}"
}
}]
rr = requests.post(ALERTMAN, data=json.dumps(payload), headers={"Content-Type":"application/json"}, timeout=10)
rr.raise_for_status()
@app.route("/run", methods=["POST","GET"])
def run():
data = fetch_series()
if data.get("status") != "success":
return jsonify({"status":"error","msg":data}), 500
result = data["data"]["result"]
anomalies = []
for series in result:
metric = series.get("metric", {})
inst = metric.get("instance","unknown")
values = series.get("values", [])
res = stl_anomaly(values)
if res:
val, z = res
push_alert(inst, val, z)
anomalies.append({"instance":inst,"value":val,"z":z})
return jsonify({"status":"ok","anomalies":anomalies})
# webhook endpoint (optional: receive resolved alerts or enrich)
@app.route("/alertmanager", methods=["POST"])
def inbound():
_ = request.json
return jsonify({"status":"received"}), 200
if __name__ == "__main__":
app.run(host="0.0.0.0", port=8080)
Now you must create the Docker file in the Anomaly detector:
sudo nano /opt/aiops/anomaly-detector/Dockerfile
Add the following config to the file:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
ENV PROM_URL=http://prometheus:9090
ENV ALERTMAN_URL=http://alertmanager:9093/api/v2/alerts
ENV PROM_QUERY=100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100)
EXPOSE 8080
CMD ["python","/app/app.py"]
Set up AIOps Stack Docker Compose File
At this point, you can easily create the whole stack in a Docker Compose file. Create the file with the command below:
sudo nano /opt/aiops/docker-compose.yml
Add the following config to the file:
services:
prometheus:
image: prom/prometheus:v2.55.1
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.enable-lifecycle"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/rules:/etc/prometheus/rules:ro
- prom_data:/prometheus
ports: ["9090:9090"]
restart: unless-stopped
alertmanager:
image: prom/alertmanager:v0.27.0
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports: ["9093:9093"]
restart: unless-stopped
loki:
image: grafana/loki:2.9.8
command: ["-config.file=/etc/loki/config.yml"]
volumes:
- ./loki/config.yml:/etc/loki/config.yml:ro
- loki_data:/loki
ports: ["3100:3100"]
restart: unless-stopped
promtail:
image: grafana/promtail:2.9.8
command: ["-config.file=/etc/promtail/config.yml"]
volumes:
- ./promtail/config.yml:/etc/promtail/config.yml:ro
- /var/log:/var/log
- promtail_positions:/positions.yml
restart: unless-stopped
blackbox:
image: prom/blackbox-exporter:v0.25.0
command: ["--config.file=/etc/blackbox/blackbox.yml"]
volumes:
- ./blackbox/blackbox.yml:/etc/blackbox/blackbox.yml:ro
ports: ["9115:9115"]
restart: unless-stopped
grafana:
image: grafana/grafana:11.2.0
environment:
- GF_SECURITY_ADMIN_PASSWORD=ChangeMe!
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana-provisioning/datasources:/etc/grafana/provisioning/datasources:ro
- ./grafana-provisioning/dashboards:/etc/grafana/dashboards
- ./grafana-provisioning/dashboards/dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml:ro
ports: ["3000:3000"]
restart: unless-stopped
anomaly-detector:
build: ./anomaly-detector
environment:
- PROM_URL=http://prometheus:9090
- ALERTMAN_URL=http://alertmanager:9093/api/v2/alerts
# You can override the query to any time-series you want to monitor with AIOps
- PROM_QUERY=100 - (avg by (instance)(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100)
ports: ["8080:8080"]
restart: unless-stopped
volumes:
prom_data:
grafana_data:
loki_data:
promtail_positions:
Once you are done, you can run the anomaly detector with the following commands:
docker compose build anomaly-detector
docker compose up -d
Check if the service is up and running with the command below:
docker compose ps
Install Node Exporter on Each Server To Monitor
At this point, you must set up Node exporter on each server target you want to monitor. It collects CPU, memory, disk, and network metrics and exposes them on port 9100, and then Prometheus scrapes these metrics for analysis and alerting.
To do this, you can run the following commands:
NODE_EXPORTER_VERSION=1.8.2
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar xzf node_exporter-.tar.gz sudo mv node_exporter-/node_exporter /usr/local/bin/
sudo useradd --no-create-home --shell /usr/sbin/nologin nodeexp || true
Then, create the systemd service for Node exporter with the command below:
sudo tee /etc/systemd/system/node_exporter.service >/dev/null <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
User=nodeexp
Group=nodeexp
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
Enable and start the service with the command below:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
Verify the 9100 port:
sudo ss -lntp | grep 9100
You must be sure the Prometheus server can reach server_ip:9100.
Verifying AIOps Stack and Creating Dashboards
It is recommended to confirm that every component of the AIOps stack is running correctly. To verify your setup, open the Prometheus Web UI by navigating to the URL below:
http://your-server-ip:9090
From there, go to Status and then Targets, make sure all targets show as UP. Also, check the Blackbox Exporter to confirm that your target pages are listed and also show as UP.
For Loki, navigate to the following URL:
http://your-server-ip:3100/ready
It must return ready, which means the service is active.
Then, navigate to the Grafana UI and log in with the default credentials:
http://your-server-ip:3000
In Grafana’s Explore section, select Prometheus and run the query up to confirm metrics are available, then switch to Loki and run the query {job=”varlogs”} to verify logs are being collected correctly.
To quickly visualize your metrics, you can create a dashboard in Grafana.
Open Grafana, go to Dashboards and then Import, and enter the dashboard ID 1860 (Node Exporter Full) from Grafana.com to get a detailed view of your servers’ performance.
Alternatively, you can place your own JSON dashboard files in the /opt/aiops/grafana-provisioning/dashboards/ directory, and they’ll automatically appear under the AIOps folder when Grafana starts.
Test Alerts and Validate AIOps Anomalies
After setting up the AIOps stack, it’s important to test that alerts and anomaly detection are working correctly. You can simulate different scenarios to confirm that Prometheus, Alertmanager, and the AIOps anomaly detector respond as expected.
For example, stop the Node Exporter service on a monitored server to test a NodeDown alert after about two minutes.
You can also manually test the AIOps anomaly detection by running the command below:
curl -X POST http://your-server-ip:8080/run
If the last metric point is an outlier, it sends an AIOpsDetectedAnomaly alert to Alertmanager via webhook.
You can automate this process with a cron job, sidecar container, or Grafana alert rule to call /run periodically and keep continuous anomaly checks active.
That’s it, you are done with setting up the AIOps Stack for server monitoring.
FAQs
What is AIOps, and why is it important?
AIOps uses machine learning and data analysis to automatically detect problems, find their causes, and connect related events. It helps reduce unnecessary alerts, fix issues faster, and predict potential system problems before they happen.
How does the AIOps anomaly detector work?
The detector regularly checks Prometheus metrics, uses STL and z-score analysis to find unusual patterns, and flags any outliers. When it detects an anomaly, it automatically sends an alert to Alertmanager, which can notify your team through email, Slack, or a webhook.
How can I automate the AIOps stack setup for multiple environments?
You can turn this stack into Ansible roles to automate deployments, Helm charts to run it on Kubernetes, and Terraform modules to manage infrastructure automatically with code.
Conclusion
Setting up an AIOps stack for server monitoring provides proactive insights, automatic anomaly detection, and clear visibility across your entire infrastructure. By combining Prometheus, Grafana, Loki, Alertmanager, and a lightweight Python anomaly detector, you can monitor system health and performance in real time and receive smart alerts.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest server monitoring guides and tips.
For optimal performance and reliability of your AIOps monitoring stack, it is recommended to use Reliable Dedicated Servers or Flexible VPS hosting.
For further reading:
Learn Server Memory Disaggregation