Linux Server Performance Monitoring with Prometheus

Linux Server Performance Monitoring with Prometheus

Linux Server Performance Monitoring with Prometheus

Monitoring server performance is an essential step to ensure reliability, stability, and efficient resource usage. Prometheus is an open-source monitoring and alerting toolkit widely used for this purpose. This step-by-step guide covers Linux Server Performance Monitoring with Prometheus, Node Exporter, and Alertmanager.

Prometheus can track key metrics such as CPU usage, memory consumption, disk I/O, and network activity. These metrics are usually exposed by exporters like Node Exporter, which integrates easily with Prometheus.

You can practice this setup on any Linux server or cloud VM. If you don’t have one yet, you can quickly spin up a server from PerLod Hosting, which provides ready-to-use Linux environments ideal for monitoring labs.

What You Will Build: Linux Server Performance Monitoring with Prometheus

Before we dive into the steps, here is what you will build in this guide.

  • Prometheus: The database and collector of metrics.
  • Node Exporter: Runs on servers to expose CPU, memory, disk, and network stats.
  • Recording Rules: Save heavy queries as ready-to-use metrics.
  • Alertmanager: Sends alerts (Slack, email, etc.), groups and silences them.
  • Blackbox Exporter (optional): Checks websites, APIs, or endpoints.
  • Grafana (optional): Dashboards to visualize everything.

Prometheus, Alertmanager, Grafana, and Blackbox will be set up on the Prometheus host (monitoring server). Node Exporter will be set up on every target server.

Prerequisites for Performance Monitoring with Prometheus

Before installing Prometheus, you must make sure your server is ready. If these basics are missing, the setup may fail or give wrong results.

1. You need a VM or host for Prometheus with 2 vCPU, 4GB RAM, 20–50GB SSD or NVMe for TSDB or WAL to start.

2. Root or Sudo access.

3. Open ports:

  • Prometheus: 9090
  • Alertmanager: 9093
  • Node Exporter: 9100
  • Blackbox Exporter: 9115

4. Time sync (chrony or systemd-timesyncd): Metrics are timestamped, so clocks must be accurate.

Set up Prometheus for Testing on One Machine

This step is for testing. You can set up Prometheus, Node Exporter, Alertmanager, and Grafana quickly using Docker Compose, all on one machine. It’s not production-ready, but it lets you learn fast.

Create a Docker Compose YAML file with your desired text editor:

nano docker-compose.yml

Add the following configuration to the file. Key parts include:

  • image: prom/prometheus:latest. It tells Docker to use the Prometheus container image.
  • ports: [“9090:9090”]. It makes Prometheus web UI available on your host machine’s port 9090.
  • volumes: It mounts configs from your host to the container.
  • command: It passes flags to the Prometheus process.
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports: ["9090:9090"]
    volumes:
      - ./prometheus:/etc/prometheus
      - promdata:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=15d
      - --web.enable-lifecycle
      - --storage.tsdb.wal-compression
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    network_mode: host
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - --path.procfs=/host/proc
      - --path.sysfs=/host/sys
      - --path.rootfs=/rootfs
      - --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|host|etc)($$|/)"
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports: ["9093:9093"]
    volumes:
      - ./alertmanager:/etc/alertmanager
    command: ["--config.file=/etc/alertmanager/alertmanager.yml"]
    restart: unless-stopped

  blackbox:
    image: prom/blackbox-exporter:latest
    container_name: blackbox
    ports: ["9115:9115"]
    volumes:
      - ./blackbox:/etc/blackbox_exporter
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports: ["3000:3000"]
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    restart: unless-stopped

volumes:
  promdata: {}
  grafana_data: {}

After creating minimal configs, you can run the Docker Compose container:

docker compose up -d

Tip: For production, pin image tags to specific versions after testing.

Install Prometheus Natively on Linux (Systemd)

In this step, you can install Prometheus natively on Linux, managed by systemd. This is more reliable for production than Docker.

Create a dedicated user and directories on your Prometheus host with the following commands:

sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus || true
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

Then, download the latest version of Prometheus with the following command. Replace the version with the latest stable. This is an example version:

VER="2.53.0"
cd /tmp
curl -fL -O https://github.com/prometheus/prometheus/releases/download/v${VER}/prometheus-${VER}.linux-amd64.tar.gz

Then, extract and install Prometheus with the following commands:

sudo tar -xzf prometheus-${VER}.linux-amd64.tar.gz -C /tmp
cd /tmp/prometheus-${VER}.linux-amd64
sudo install -o root -g root -m 0755 prometheus promtool /usr/local/bin/
sudo cp -r console_libraries consoles /etc/prometheus/

Next, you must configure a basic Prometheus YAML file. Create the file with your desired text editor:

nano /etc/prometheus/prometheus.yml

Add the following basic configuration to the file:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["localhost:9093"]

rule_files:
  - /etc/prometheus/rules.yml

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: node_exporter
    static_configs:
      # Add all your servers here (replace with real IPs/hosts)
      - targets: [
          "localhost:9100"
        ]

The “global.scrape_interval: 15s” means Prometheus scrapes metrics every 15s.

To run Prometheus as a service, you must create the systemd unit file:

nano /etc/systemd/system/prometheus.service

Add the following config to the file:

[Unit]
Description=Prometheus TSDB
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus \
--storage.tsdb.retention.time=15d \
--storage.tsdb.wal-compression \
--web.enable-lifecycle
ExecReload=/bin/kill -HUP $MAINPID
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then, start and enable the service with the following commands:

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus --no-pager -l

To verify that Prometheus and Node Exporter are up and running, you can navigate to the following URL:

http://PROMETHEUS_HOST:9090/targets

Install Node Exporter on Every Linux Server You Want to Monitor

Now you must install Node Exporter on every Linux server you want to monitor. It collects CPU, memory, disk, and network metrics and exposes them on port 9100.

First, create a non-login user for Node exporter with the following command:

sudo useradd --no-create-home --shell /usr/sbin/nologin nodeusr || true

Then, download the latest binary package of Node exporter. This is an example version; update it if needed.

VER="1.8.1"
cd /tmp
curl -fL -O https://github.com/prometheus/node_exporter/releases/download/v${VER}/node_exporter-${VER}.linux-amd64.tar.gz

Extract the downloaded file and install it with the following commands:

sudo tar -xzf node_exporter-${VER}.linux-amd64.tar.gz -C /usr/local/bin --strip-components=1 node_exporter-${VER}.linux-amd64/node_exporter

sudo chown root:root /usr/local/bin/node_exporter

To run Node exporter as a service, create the systemd unit file:

nano /etc/systemd/system/node_exporter.service

Add the following content to it:

[Unit]
Description=Prometheus Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nodeusr
Group=nodeusr
Type=simple
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=":9100" \
  --collector.tcpstat \
  --collector.processes \
  --collector.filesystem.ignored-mount-points="^/(sys|proc|dev|run|var/lib/docker/.+|snap)($$|/)" \
  --collector.filesystem.mount-points-exclude="^/(sys|proc|dev|run|var/lib/docker/.+|snap)($$|/)"
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then, start and enable Node exporter with:

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
sudo systemctl status node_exporter --no-pager -l

To test the metric locally, you can run:

curl -fsS http://localhost:9100/metrics | head

It shows the first lines of metrics to confirm it works.

Now you can add this host’s IP:9100 to the Prometheus node_exporter job’s targets list. Remember to allow TCP 9100 from the Prometheus server only.

Set up Blackbox Exporter (Optional)

While Node Exporter shows internal health like CPU, disk, and memory, Blackbox Exporter checks from the “outside”. For example, can your website be reached? Can it be pinged?

Create the Blackbox config file:

nano /etc/blackbox_exporter/blackbox.yml

Add the following configuration to the file:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      preferred_ip_protocol: "ip4"
  icmp:
    prober: icmp
    timeout: 3s
  • http_2xx: Checks if an HTTP endpoint returns a success code (200).
  • icmp: Checks if a host responds to ping.

Then, you must add this Prometheus scrape job for Blackbox to the prometheus.yml file:

- job_name: blackbox_http
  metrics_path: /probe
  params:
    module: [http_2xx]
  static_configs:
    - targets:
        - https://example.com/
        - https://api.example.com/health
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: 127.0.0.1:9115  # blackbox exporter address

Signals you must watch for include:

probe_success, probe_http_status_code, probe_duration_seconds, probe_dns_lookup_time_seconds, probe_tcp_connection_duration_seconds

Prometheus Recording Rules (Pre-computed Metrics)

Prometheus queries can get heavy, especially with rate() or avg by. Recording rules save results into new “pre-computed” metrics, so queries and dashboards are faster.

Create the recording rules file on the Prometheus host:

nano /etc/prometheus/rules.yml

Then, add the following rules to the file:

groups:
- name: sre-core
  interval: 15s
  rules:
    # ==== CPU UTILIZATION / SATURATION ====
    - record: node:cpu_utilization:avg5m
      expr: 100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))

    - record: node:cpu_iowait:avg5m
      expr: 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))

    - record: node:cpu_steal:avg5m
      expr: 100 * avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m]))

    # CPU load vs core count (load>cores suggests saturation)
    - record: node:load1_per_core
      expr: node_load1 / count by (instance) (node_cpu_seconds_total{mode="system"})

    # ==== MEMORY PRESSURE ====
    - record: node:memory_used_percent
      expr: 100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

    # ==== DISK UTILIZATION & LATENCY ====
    - record: node:disk_util_percent
      expr: 100 * max by (instance, device) (
              rate(node_disk_io_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
            )

    - record: node:disk_read_latency_ms
      expr: 1000 * (rate(node_disk_read_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
                    / rate(node_disk_reads_completed_total{device!~"loop|ram|fd|sr.*"}[5m]))

    - record: node:disk_write_latency_ms
      expr: 1000 * (rate(node_disk_write_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m])
                    / rate(node_disk_writes_completed_total{device!~"loop|ram|fd|sr.*"}[5m]))

    # ==== FILESYSTEM CAPACITY ====
    - record: node:fs_used_percent
      expr: 100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"} \
                        / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"})

    # ==== NETWORK THROUGHPUT & ERRORS ====
    - record: node:net_rx_bytes_per_s
      expr: sum by (instance, device) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))

    - record: node:net_tx_bytes_per_s
      expr: sum by (instance, device) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))

    - record: node:tcp_retrans_per_s
      expr: rate(node_netstat_Tcp_RetransSegs[5m])

    # ==== BLACKBOX (if used) ====
    - record: blackbox:http_availability
      expr: avg by (instance) (probe_success)

    - record: blackbox:http_duration_seconds
      expr: avg by (instance) (probe_duration_seconds)

Check if the rules file is valid with the command below:

promtool check rules /etc/prometheus/rules.yml

Next, run the following command to reload the config without restarting:

curl -X POST http://localhost:9090/-/reload

Or, you can run:

sudo systemctl reload prometheus

Set up Alerts for Prometheus that Catch Bottlenecks Early

Alerts notify you when something’s wrong, like high CPU, full disk, and the host is down. Prometheus checks alert rules and sends them to Alertmanager.

Create the alerts file and then include this file under rule_files in prometheus.yml:

nano /etc/prometheus/alerts.yml

Add the following alerts to the file:

groups:
- name: node-alerts
  rules:
    - alert: InstanceDown
      expr: up == 0
      for: 5m
      labels: { severity: critical }
      annotations:
        summary: "Instance down ({{ $labels.instance }})"
        description: "No scrape targets responding for 5m."

    - alert: HighCPU
      expr: node:cpu_utilization:avg5m > 85
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "High CPU on {{ $labels.instance }}"
        description: "CPU > 85% (avg 5m) for 10m; check hot processes and scaling."

    - alert: HighCPU_IOWait
      expr: node:cpu_iowait:avg5m > 10
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "High IO wait on {{ $labels.instance }}"
        description: "CPU waiting on disk > 10% for 10m; suspect disk bottleneck."

    - alert: LoadExceedsCores
      expr: node:load1_per_core > 1.0
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "CPU saturation on {{ $labels.instance }}"
        description: "Load1 per core > 1 indicates runnable queue backlog."

    - alert: MemoryPressure
      expr: node:memory_used_percent > 90
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "Memory pressure on {{ $labels.instance }}"
        description: "Available memory < 10% for 10m; check caches/process leaks."

    - alert: DiskUtilHigh
      expr: node:disk_util_percent > 80
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "Disk busy on {{ $labels.instance }} ({{ $labels.device }})"
        description: "Disk io_time > 80% for 10m; investigate latency & queue."

    - alert: DiskLatencyHigh
      expr: (node:disk_read_latency_ms > 50) or (node:disk_write_latency_ms > 50)
      for: 10m
      labels: { severity: warning }
      annotations:
        summary: "Disk latency high on {{ $labels.instance }}"
        description: "Average disk latency > 50 ms; suspect underlying storage."

    - alert: FilesystemFilling
      expr: predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"}[6h], 4*3600) < 0
      for: 15m
      labels: { severity: warning }
      annotations:
        summary: "Filesystem filling soon on {{ $labels.instance }}"
        description: "Projected to fill in < 4h. Act before outage."

    - alert: NetworkErrors
      expr: rate(node_network_receive_errs_total[5m]) > 0 or rate(node_network_transmit_errs_total[5m]) > 0
      for: 5m
      labels: { severity: warning }
      annotations:
        summary: "NIC errors on {{ $labels.instance }}"
        description: "Persistent NIC errors/drops; check cabling, MTU, driver."

- name: blackbox-alerts
  rules:
    - alert: EndpointDown
      expr: blackbox:http_availability < 1
      for: 2m
      labels: { severity: critical }
      annotations:
        summary: "Endpoint down ({{ $labels.instance }})"
        description: "Blackbox probe failing."

    - alert: SlowEndpoint
      expr: blackbox:http_duration_seconds > 1
      for: 5m
      labels: { severity: warning }
      annotations:
        summary: "Slow endpoint ({{ $labels.instance }})"
        description: "End-to-end latency >1s; check DNS/TCP/SSL/app."

You can reload and validate the configuration with the following commands:

promtool check rules /etc/prometheus/alerts.yml
curl -X POST http://localhost:9090/-/reload

Set up Alertmanager for Prometheus

Prometheus fires alerts, but Alertmanager decides what to do. For example, send to Slack or email, group similar alerts, and silence alerts temporarily.

Create the user, download, and set up Alertmanager with the following commands:

sudo useradd --no-create-home --shell /usr/sbin/nologin alertmanager || true
VER="0.27.0" #example version; update as needed
cd /tmp
curl -fL -O https://github.com/prometheus/alertmanager/releases/download/v${VER}/alertmanager-${VER}.linux-amd64.tar.gz
sudo tar -xzf alertmanager-${VER}.linux-amd64.tar.gz -C /tmp
cd /tmp/alertmanager-${VER}.linux-amd64
sudo install alertmanager amtool /usr/local/bin/
sudo mkdir -p /etc/alertmanager
sudo chown -R alertmanager:alertmanager /etc/alertmanager

Then, create the Alertmanager YAML file:

nano /etc/alertmanager/alertmanager.yml

Add the following content to the file with Slack and Email examples:

global:
  resolve_timeout: 5m

route:
  receiver: default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 3m
  repeat_interval: 4h

receivers:
  - name: default
    slack_configs:
      - send_resolved: true
        api_url: "https://hooks.slack.com/services/XXX/YYY/ZZZ"
        channel: "#alerts"
        title: "{{ .CommonLabels.alertname }}: {{ .CommonLabels.instance }}"
        text: "{{ range .Alerts }}*{{ .Annotations.summary }}*\n{{ .Annotations.description }}\n{{ end }}"
    email_configs:
      - to: [email protected]
        from: [email protected]
        smarthost: smtp.example.com:587
        auth_username: [email protected]
        auth_identity: [email protected]
        auth_password: "REDACTED"

To run Alertmanager as a service, create a systemd unit file:

nano /etc/systemd/system/alertmanager.service

Add this to the file:

[Unit]
Description=Prometheus Alertmanager
After=network-online.target
Wants=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager
Restart=on-failure

[Install]
WantedBy=multi-user.target

Then start and enable Alertmanager with the following commands:

sudo mkdir -p /var/lib/alertmanager && sudo chown alertmanager:alertmanager /var/lib/alertmanager
sudo systemctl daemon-reload
sudo systemctl enable --now alertmanager
sudo systemctl status alertmanager --no-pager -l

Add Grafana Dashboard (Optional)

Grafana gives you nice dashboards and graphs. Install Grafana from the official repository. Then:

  • Point a Prometheus data source at http://PROMETHEUS_HOST:9090.
  • Import a community Node Exporter dashboard and a Blackbox dashboard for quick visibility.
  • Build panels with the recording rules to keep dashboards fast.

Useful PromQL Queries

PromQL is Prometheus’s query language. This step shows useful queries for CPU, disk, memory, etc. All queries are aggregated by instance. So you see which server is hot.

CPU:

# Total CPU utilization (%): idle → busy
100 * (1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])))

# I/O wait & steal time (%):
100 * avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m]))
100 * avg by (instance) (rate(node_cpu_seconds_total{mode="steal"}[5m]))

# Load per core (>1 means runnable queue backlog)
node_load1 / count by (instance) (node_cpu_seconds_total{mode="system"})

Memory:

# Memory used percent (%):
100 * (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

Disk:

# Disk busy (%% of time doing I/O): per device
100 * max by (instance, device) (rate(node_disk_io_time_seconds_total{device!~"loop|ram|fd|sr.*"}[5m]))

# Average latency (ms) per op:
1000 * (rate(node_disk_read_time_seconds_total[5m]) / rate(node_disk_reads_completed_total[5m]))
1000 * (rate(node_disk_write_time_seconds_total[5m]) / rate(node_disk_writes_completed_total[5m]))

# Filesystem usage (%): exclude tmpfs/overlay
100 * (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"}
             / node_filesystem_size_bytes{fstype!~"tmpfs|overlay|aufs|squashfs"})

Network:

# Throughput per NIC (bytes/s):
sum by (instance, device) (rate(node_network_receive_bytes_total{device!~"lo"}[5m]))
sum by (instance, device) (rate(node_network_transmit_bytes_total{device!~"lo"}[5m]))

# Errors/drops & TCP retransmits (packets/s):
rate(node_network_receive_errs_total[5m])
rate(node_network_transmit_errs_total[5m])
rate(node_network_receive_drop_total[5m])
rate(node_network_transmit_drop_total[5m])
rate(node_netstat_Tcp_RetransSegs[5m])

Blackbox (E2E):

# Availability and timing
avg by (instance) (probe_success)
max by (instance, phase) (probe_duration_seconds)

Check Bottlenecks Systematically: USE And RED

There are 2 troubleshooting frameworks, including USE and RED. They help you to check bottlenecks systematically instead of guessing.

USE (Utilization, Saturation, Errors): For system resources. Example: CPU usage %, load average, disk latency, and network errors.

RED (Rate, Errors, Duration): For services. Example: request rate, error rate, and response duration.

FAQs

What is the minimum system requirement to run Prometheus in production?

At least 2 vCPUs, 4 GB RAM, and 20–50 GB SSD/NVMe for TSDB/WAL.

What signals does Node Exporter collect?

CPU, memory, disk, network, filesystem, TCP states, and more.

Which PromQL queries are most useful for detecting bottlenecks?

CPU utilization, load per core, memory usage %, disk utilization & latency, NIC throughput/errors, Blackbox probe success and duration.

Conclusion

You now have a complete path from a local Docker lab to a production, systemd-based monitoring stack. Prometheus collects metrics, Node Exporter exposes host signals, recording rules precompute critical SRE signals, Alertmanager routes and silences alerts, and Blackbox and Grafana complete end-to-end checks and visualization.

Looking for reliable Linux hosting to run this stack in production? Try PerLod Bare Metal Hosting, which offers optimized servers for monitoring and observability workloads.

We hope you enjoy this guide on Linux Server Performance Monitoring with Prometheus. Subscribe to X and Facebook channels to get the latest articles and news.

For further reading:

Move from Shared Hosting to VPS without Downtime

JMeter VPS Load Testing: Advanced Step-By-Step Guide

Linux kernel live patching with Zero Downtime

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.