
Linux Server Performance Monitoring with Prometheus
Monitoring server performance is an essential step to ensure reliability, stability, and efficient resource usage. Prometheus is an open-source monitoring and alerting toolkit widely used for this purpose. This step-by-step guide covers Linux Server Performance Monitoring with Prometheus, Node Exporter, and Alertmanager.
Prometheus can track key metrics such as CPU usage, memory consumption, disk I/O, and network activity. These metrics are usually exposed by exporters like Node Exporter, which integrates easily with Prometheus.
You can practice this setup on any Linux server or cloud VM. If you don’t have one yet, you can quickly spin up a server from PerLod Hosting, which provides ready-to-use Linux environments ideal for monitoring labs.
Table of Contents
What You Will Build: Linux Server Performance Monitoring with Prometheus
Before we dive into the steps, here is what you will build in this guide.
- Prometheus: The database and collector of metrics.
- Node Exporter: Runs on servers to expose CPU, memory, disk, and network stats.
- Recording Rules: Save heavy queries as ready-to-use metrics.
- Alertmanager: Sends alerts (Slack, email, etc.), groups and silences them.
- Blackbox Exporter (optional): Checks websites, APIs, or endpoints.
- Grafana (optional): Dashboards to visualize everything.
Prometheus, Alertmanager, Grafana, and Blackbox will be set up on the Prometheus host (monitoring server). Node Exporter will be set up on every target server.
Prerequisites for Performance Monitoring with Prometheus
Before installing Prometheus, you must make sure your server is ready. If these basics are missing, the setup may fail or give wrong results.
1. You need a VM or host for Prometheus with 2 vCPU, 4GB RAM, 20–50GB SSD or NVMe for TSDB or WAL to start.
2. Root or Sudo access.
3. Open ports:
- Prometheus: 9090
- Alertmanager: 9093
- Node Exporter: 9100
- Blackbox Exporter: 9115
4. Time sync (chrony or systemd-timesyncd): Metrics are timestamped, so clocks must be accurate.
Set up Prometheus for Testing on One Machine
This step is for testing. You can set up Prometheus, Node Exporter, Alertmanager, and Grafana quickly using Docker Compose, all on one machine. It’s not production-ready, but it lets you learn fast.
Create a Docker Compose YAML file with your desired text editor:
nano docker-compose.yml
Add the following configuration to the file. Key parts include:
- image: prom/prometheus:latest. It tells Docker to use the Prometheus container image.
- ports: [“9090:9090”]. It makes Prometheus web UI available on your host machine’s port 9090.
- volumes: It mounts configs from your host to the container.
- command: It passes flags to the Prometheus process.
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports: ["9090:9090"]
volumes:
- ./prometheus:/etc/prometheus
- promdata:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --storage.tsdb.retention.time=15d
- --web.enable-lifecycle
- --storage.tsdb.wal-compression
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
network_mode: host
pid: host
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --path.rootfs=/rootfs
- --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)($$|/)"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
ports: ["9093:9093"]
volumes:
- ./alertmanager:/etc/alertmanager
command: ["--config.file=/etc/alertmanager/alertmanager.yml"]
restart: unless-stopped
blackbox:
image: prom/blackbox-exporter:latest
container_name: blackbox
ports: ["9115:9115"]
volumes:
- ./blackbox:/etc/blackbox_exporter
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports: ["3000:3000"]
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
volumes:
promdata: {}
grafana_data: {}
After creating minimal configs, you can run the Docker Compose container:
docker compose up -d
Tip: For production, pin image tags to specific versions after testing.
Install Prometheus Natively on Linux (Systemd)
In this step, you can install Prometheus natively on Linux, managed by systemd. This is more reliable for production than Docker.
Create a dedicated user and directories on your Prometheus host with the following commands:
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
Then, download the latest version of Prometheus with the following command. Replace the version with the latest stable. This is an example version:
VER="2.53.0"
cd /tmp
curl -fL -O https://github.com/prometheus/prometheus/releases/download/v${VER}/prometheus-${VER}.linux-amd64.tar.gz
Then, extract and install Prometheus with the following commands:
sudo tar -xzf prometheus-${VER}.linux-amd64.tar.gz -C /tmp
cd /tmp/prometheus-${VER}.linux-amd64
sudo install -o root -g root -m 0755 prometheus promtool /usr/local/bin/
sudo cp -r console_libraries consoles /etc/prometheus/
Next, you must configure a basic Prometheus YAML file. Create the file with your desired text editor:
nano /etc/prometheus/prometheus.yml
Add the following basic configuration to the file:
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
rule_files:
- /etc/prometheus/rules.yml
scrape_configs:
- job_name: prometheus
static_configs:
- targets: ["localhost:9090"]
- job_name: node_exporter
static_configs:
# Add all your servers here (replace with real IPs/hosts)
- targets: [
"localhost:9100"
]
The “global.scrape_interval: 15s” means Prometheus scrapes metrics every 15s.
To run Prometheus as a service, you must create the systemd unit file:
nano /etc/systemd/system/prometheus.service
Add the following config to the file:
[Unit]
Description=Prometheus TSDB
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=15d \
--storage.tsdb.wal-compression \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure
[Install]
WantedBy=multi-user.target
Then, start and enable the service with the following commands:
sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pager -l
To verify that Prometheus and Node Exporter are up and running, you can navigate to the following URL:
http://PROMETHEUS_HOST:9090/targets
Install Node Exporter on Every Linux Server You Want to Monitor
Now you must install Node Exporter on every Linux server you want to monitor. It collects CPU, memory, disk, and network metrics and exposes them on port 9100.
First, create a non-login user for Node exporter with the following command:
sudo useradd --no-create-home --shell /usr/sbin/nologin nodeusr || true
Then, download the latest binary package of Node exporter. This is an example version; update it if needed.
VER="1.8.1"
cd /tmp
curl -fL -O https://github.com/prometheus/node_exporter/releases/download/v${VER}/node_exporter-${VER}.linux-amd64.tar.gz
Extract the downloaded file and install it with the following commands:
sudo tar -xzf node_exporter-${VER}.linux-amd64.tar.gz -C /usr/local/bin --strip-components=1 node_exporter-${VER}.linux-amd64/node_exporter
sudo chown root:root /usr/local/bin/node_exporter
To run Node exporter as a service, create the systemd unit file:
nano /etc/systemd/system/node_exporter.service
Add the following content to it:
[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=":9100" \
--collector.tcpstat \
--collector.processes \
--collector.filesystem.ignored-mount-points="^/(sys|proc|dev|run|var/lib/docker/.+|snap)($$|/)" \
--collector.filesystem.mount-points-exclude="^/(sys|proc|dev|run|var/lib/docker/.+|snap)($$|/)"
Restart=on-failure
[Install]
WantedBy=multi-user.target
Then, start and enable Node exporter with:
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter --no-pager -l
To test the metric locally, you can run:
curl -fsS http://localhost:9100/metrics | head
It shows the first lines of metrics to confirm it works.
Now you can add this host’s IP:9100
to the Prometheus node_exporter job’s targets list. Remember to allow TCP 9100 from the Prometheus server only.
Set up Blackbox Exporter (Optional)
While Node Exporter shows internal health like CPU, disk, and memory, Blackbox Exporter checks from the “outside”. For example, can your website be reached? Can it be pinged?
Create the Blackbox config file:
nano /etc/blackbox_exporter/blackbox.yml
Add the following configuration to the file:
modules:
http_2xx:
prober: http
timeout: 5s
http:
preferred_ip_protocol: "ip4"
icmp:
prober: icmp
timeout: 3s
- http_2xx: Checks if an HTTP endpoint returns a success code (200).
- icmp: Checks if a host responds to ping.
Then, you must add this Prometheus scrape job for Blackbox to the prometheus.yml file:
- job_name: blackbox_http
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com/
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # blackbox exporter address
Signals you must watch for include:
probe_success, probe_http_status_code, probe_duration_seconds, probe_dns_lookup_time_seconds, probe_tcp_connection_duration_seconds
Prometheus Recording Rules (Pre-computed Metrics)
Prometheus queries can get heavy, especially with rate() or avg by. Recording rules save results into new “pre-computed” metrics, so queries and dashboards are faster.
Create the recording rules file on the Prometheus host:
nano /etc/prometheus/rules.yml
Then, add the following rules to the file:
groups:
- name: sre-core
interval: 15s
rules:
# ==== CPU UTILIZATION / SATURATION ====
- record: node:cpu_utilization:avg5m
expr: 100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
- record: node:cpu_iowait:avg5m
expr: 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))
- record: node:cpu_steal:avg5m
expr: 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m]))
# CPU load vs core count (load>cores suggests saturation)
- record: node:load1_per_core
expr: node_load1 / count by (instance) (node_cpu_seconds_total{mode="system"})
# ==== MEMORY PRESSURE ====
- record: node:memory_used_percent
expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# ==== DISK UTILIZATION & LATENCY ====
- record: node:disk_util_percent
expr: 100 * max by (instance, device) (
rate(node_disk_io_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
)
- record: node:disk_read_latency_ms
expr: 1000 * (rate(node_disk_read_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
/ rate(node_disk_reads_completed_total{device!~"loop|ram|fd|sr.*"}[5m]))
- record: node:disk_write_latency_ms
expr: 1000 * (rate(node_disk_write_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
/ rate(node_disk_writes_completed_total{device!~"loop|ram|fd|sr.*"}[5m]))
# ==== FILESYSTEM CAPACITY ====
- record: node:fs_used_percent
expr: 100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"} \
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"})
# ==== NETWORK THROUGHPUT & ERRORS ====
- record: node:net_rx_bytes_per_s
expr: sum by (instance, device) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
- record: node:net_tx_bytes_per_s
expr: sum by (instance, device) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
- record: node:tcp_retrans_per_s
expr: rate(node_netstat_Tcp_RetransSegs[5m])
# ==== BLACKBOX (if used) ====
- record: blackbox:http_availability
expr: avg by (instance) (probe_success)
- record: blackbox:http_duration_seconds
expr: avg by (instance) (probe_duration_seconds)
Check if the rules file is valid with the command below:
promtool check rules /etc/prometheus/rules.yml
Next, run the following command to reload the config without restarting:
curl -X POST http://localhost:9090/-/reload
Or, you can run:
sudo systemctl reload prometheus
Set up Alerts for Prometheus that Catch Bottlenecks Early
Alerts notify you when something’s wrong, like high CPU, full disk, and the host is down. Prometheus checks alert rules and sends them to Alertmanager.
Create the alerts file and then include this file under rule_files in prometheus.yml:
nano /etc/prometheus/alerts.yml
Add the following alerts to the file:
groups:
- name: node-alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Instance down ({{ $labels.instance }})"
description: "No scrape targets responding for 5m."
- alert: HighCPU
expr: node:cpu_utilization:avg5m > 85
for: 10m
labels: { severity: warning }
annotations:
summary: "High CPU on {{ $labels.instance }}"
description: "CPU > 85% (avg 5m) for 10m; check hot processes and scaling."
- alert: HighCPU_IOWait
expr: node:cpu_iowait:avg5m > 10
for: 10m
labels: { severity: warning }
annotations:
summary: "High IO wait on {{ $labels.instance }}"
description: "CPU waiting on disk > 10% for 10m; suspect disk bottleneck."
- alert: LoadExceedsCores
expr: node:load1_per_core > 1.0
for: 10m
labels: { severity: warning }
annotations:
summary: "CPU saturation on {{ $labels.instance }}"
description: "Load1 per core > 1 indicates runnable queue backlog."
- alert: MemoryPressure
expr: node:memory_used_percent > 90
for: 10m
labels: { severity: warning }
annotations:
summary: "Memory pressure on {{ $labels.instance }}"
description: "Available memory < 10% for 10m; check caches/process leaks."
- alert: DiskUtilHigh
expr: node:disk_util_percent > 80
for: 10m
labels: { severity: warning }
annotations:
summary: "Disk busy on {{ $labels.instance }} ({{ $labels.device }})"
description: "Disk io_time > 80% for 10m; investigate latency & queue."
- alert: DiskLatencyHigh
expr: (node:disk_read_latency_ms > 50) or (node:disk_write_latency_ms > 50)
for: 10m
labels: { severity: warning }
annotations:
summary: "Disk latency high on {{ $labels.instance }}"
description: "Average disk latency > 50 ms; suspect underlying storage."
- alert: FilesystemFilling
expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"}[6h], 4*3600) < 0
for: 15m
labels: { severity: warning }
annotations:
summary: "Filesystem filling soon on {{ $labels.instance }}"
description: "Projected to fill in < 4h. Act before outage."
- alert: NetworkErrors
expr: rate(node_network_receive_errs_total[5m]) > 0 or rate(node_network_transmit_errs_total[5m]) > 0
for: 5m
labels: { severity: warning }
annotations:
summary: "NIC errors on {{ $labels.instance }}"
description: "Persistent NIC errors/drops; check cabling, MTU, driver."
- name: blackbox-alerts
rules:
- alert: EndpointDown
expr: blackbox:http_availability < 1
for: 2m
labels: { severity: critical }
annotations:
summary: "Endpoint down ({{ $labels.instance }})"
description: "Blackbox probe failing."
- alert: SlowEndpoint
expr: blackbox:http_duration_seconds > 1
for: 5m
labels: { severity: warning }
annotations:
summary: "Slow endpoint ({{ $labels.instance }})"
description: "End-to-end latency >1s; check DNS/TCP/SSL/app."
You can reload and validate the configuration with the following commands:
promtool check rules /etc/prometheus/alerts.yml
curl -X POST http://localhost:9090/-/reload
Set up Alertmanager for Prometheus
Prometheus fires alerts, but Alertmanager decides what to do. For example, send to Slack or email, group similar alerts, and silence alerts temporarily.
Create the user, download, and set up Alertmanager with the following commands:
sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
VER="0.27.0" #example version; update as needed
cd /tmp
curl -fL -O https://github.com/prometheus/alertmanager/releases/download/v${VER}/alertmanager-${VER}.linux-amd64.tar.gz
sudo tar -xzf alertmanager-${VER}.linux-amd64.tar.gz -C /tmp
cd /tmp/alertmanager-${VER}.linux-amd64
sudo install alertmanager amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager
Then, create the Alertmanager YAML file:
nano /etc/alertmanager/alertmanager.yml
Add the following content to the file with Slack and Email examples:
global:
resolve_timeout: 5m
route:
receiver: default
group_by: [alertname, instance]
group_wait: 30s
group_interval: 3m
repeat_interval: 4h
receivers:
- name: default
slack_configs:
- send_resolved: true
api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
channel: "#alerts"
title: "{{ .CommonLabels.alertname }}: {{ .CommonLabels.instance }}"
text: "{{ range .Alerts }}*{{ .Annotations.summary }}*\n{{ .Annotations.description }}\n{{ end }}"
email_configs:
- to: [email protected]
from: [email protected]
smarthost: smtp.example.com:587
auth_username: [email protected]
auth_identity: [email protected]
auth_password: "REDACTED"
To run Alertmanager as a service, create a systemd unit file:
nano /etc/systemd/system/alertmanager.service
Add this to the file:
[Unit]
Description=Prometheus Alertmanager
After=network-online.target
Wants=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager
Restart=on-failure
[Install]
WantedBy=multi-user.target
Then start and enable Alertmanager with the following commands:
sudo mkdir -p /var/lib/alertmanager && sudo chown alertmanager:alertmanager /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager -l
Add Grafana Dashboard (Optional)
Grafana gives you nice dashboards and graphs. Install Grafana from the official repository. Then:
- Point a Prometheus data source at http://PROMETHEUS_HOST:9090.
- Import a community Node Exporter dashboard and a Blackbox dashboard for quick visibility.
- Build panels with the recording rules to keep dashboards fast.
Useful PromQL Queries
PromQL is Prometheus’s query language. This step shows useful queries for CPU, disk, memory, etc. All queries are aggregated by instance. So you see which server is hot.
CPU:
# Total CPU utilization (%): idle → busy
100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))
# I/O wait & steal time (%):
100 * avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))
100 * avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m]))
# Load per core (>1 means runnable queue backlog)
node_load1 / count by (instance) (node_cpu_seconds_total{mode="system"})
Memory:
# Memory used percent (%):
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
Disk:
# Disk busy (%% of time doing I/O): per device
100 * max by (instance, device) (rate(node_disk_io_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m]))
# Average latency (ms) per op:
1000 * (rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]))
1000 * (rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]))
# Filesystem usage (%): exclude tmpfs/overlay
100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"}
/ node_filesystem_size_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"})
Network:
# Throughput per NIC (bytes/s):
sum by (instance, device) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
sum by (instance, device) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))
# Errors/drops & TCP retransmits (packets/s):
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
rate(node_network_receive_drop_total[5m])
rate(node_network_transmit_drop_total[5m])
rate(node_netstat_Tcp_RetransSegs[5m])
Blackbox (E2E):
# Availability and timing
avg by (instance) (probe_success)
max by (instance, phase) (probe_duration_seconds)
Check Bottlenecks Systematically: USE And RED
There are 2 troubleshooting frameworks, including USE and RED. They help you to check bottlenecks systematically instead of guessing.
USE (Utilization, Saturation, Errors): For system resources. Example: CPU usage %, load average, disk latency, and network errors.
RED (Rate, Errors, Duration): For services. Example: request rate, error rate, and response duration.
FAQs
What is the minimum system requirement to run Prometheus in production?
At least 2 vCPUs, 4 GB RAM, and 20–50 GB SSD/NVMe for TSDB/WAL.
What signals does Node Exporter collect?
CPU, memory, disk, network, filesystem, TCP states, and more.
Which PromQL queries are most useful for detecting bottlenecks?
CPU utilization, load per core, memory usage %, disk utilization & latency, NIC throughput/errors, Blackbox probe success and duration.
Conclusion
You now have a complete path from a local Docker lab to a production, systemd-based monitoring stack. Prometheus collects metrics, Node Exporter exposes host signals, recording rules precompute critical SRE signals, Alertmanager routes and silences alerts, and Blackbox and Grafana complete end-to-end checks and visualization.
Looking for reliable Linux hosting to run this stack in production? Try PerLod Bare Metal Hosting, which offers optimized servers for monitoring and observability workloads.
We hope you enjoy this guide on Linux Server Performance Monitoring with Prometheus. Subscribe to X and Facebook channels to get the latest articles and news.
For further reading:
Move from Shared Hosting to VPS without Downtime