Advanced Zabbix Monitoring for Dedicated NVIDIA GPU Servers

Zabbix Monitoring for Dedicated GPU Servers

Advanced Zabbix Monitoring for Dedicated NVIDIA GPU Servers

Modern AI, rendering, and HPC workloads rely on dedicated GPU servers, so having deep, real-time visibility into GPU utilization, memory, temperature, and power is essential for performance and uptime. In this guide, we will explore and set up Zabbix Monitoring for Dedicated GPU Servers.

You will learn a complete setup of Zabbix 7.4 with Ubuntu 24.04 LTS OS.

Also, you will learn how to integrate your NVIDIA-based GPU nodes using Zabbix agent2 and the official NVIDIA template, which allows you to monitor every GPU in your infrastructure.

If you are looking for high-performance GPU servers optimized for AI, rendering, and HPC workloads, PerLod Hosting provides the best plans for you.

Requirements for Zabbix Monitoring for Dedicated GPU Servers

Before starting the setup, make sure your Zabbix server and GPU nodes meet these requirements:

  • A clean Ubuntu 24.04 LTS server VM or bare metal) for the Zabbix server, with at least 4 vCPUs, 8 GB RAM, and 50 GB+ disk if you plan to monitor multiple GPU nodes and keep history data.
  • Root or sudo access on the Zabbix server and all dedicated GPU nodes with working internet access to download Zabbix repositories and packages.
  • A static IP address is configured on the Zabbix server and each GPU node, with firewall rules allowing TCP ports 80/8080 (HTTP), 443 (if using HTTPS), and 10050 (Zabbix agent).
  • NVIDIA drivers and CUDA stack already installed on each dedicated GPU server, with nvidia-smi working correctly and showing all GPUs.

Now, proceed to the following steps to set up Zabbix Monitoring for Dedicated GPU Servers.

Step 1. Set up Latest Zabbix Server for GPU Monitoring

The first step is to install the Zabbix server on your VM or bare metal. Add the latest Zabbix repository to your server with the commands below:

sudo -s
wget https://repo.zabbix.com/zabbix/7.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_7.4-1+ubuntu24.04_all.deb
dpkg -i zabbix-release_7.4-1+ubuntu24.04_all.deb
apt update

Then, install Zabbix server, frontend, and agent with the following command:

apt install zabbix-server-mysql zabbix-frontend-php zabbix-nginx-conf zabbix-sql-scripts zabbix-agent2 -y

On Ubuntu 24.04, PHP 8.3 is installed by default, and Nginx with PHP-FPM works well with the Zabbix frontend package.

Now you must install MariaDB or MySQL server and run the security script and set up a root password for it with the commands below:

apt install mariadb-server -y
mysql_secure_installation

Log in to your MariaDB shell with the password you have configured and create a DB and user with the commands below:

mysql -u root -p
CREATE DATABASE zabbix CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
CREATE USER 'zabbix'@'localhost' IDENTIFIED BY 'Strong_Zabbix_DB_Pass';
GRANT ALL PRIVILEGES ON zabbix.* TO 'zabbix'@'localhost';
FLUSH PRIVILEGES;
EXIT;

Once you are done, import the schema, which initializes the Zabbix schema in the zabbix database:

zcat /usr/share/zabbix-sql-scripts/mysql/server.sql.gz | mysql -u zabbix -p zabbix

At this point, you must configure the Zabbix server configuration file for scale. Open the file with your desired text editor:

nano /etc/zabbix/zabbix_server.conf

Key metrics to edit:

DBName=zabbix
DBUser=zabbix
DBPassword=Strong_Zabbix_DB_Pass

CacheSize=512M
HistoryCacheSize=256M
TrendCacheSize=128M
StartPollers=40
StartPollersUnreachable=20
StartTrappers=20
StartDiscoverers=10

Note: If you monitor many GPU nodes with many NVML metrics, you can adjust higher values.

Step 2. Configure Nginx and PHP-FPM for Zabbix GPU Monitoring

At this point, you must prepare the Zabbix web interface to run efficiently on Ubuntu 24.04 using Nginx as the web server and PHP-FPM 8.3 for processing PHP requests.

Create or edit the Zabbix Nginx config file with the command below:

nano /etc/zabbix/nginx.conf

Adjust or add the following settings to the file:

server {
    listen          8080;
    server_name     zabbix.example.com;

    root    /usr/share/zabbix;

    index   index.php;

    location = /favicon.ico {
        log_not_found   off;
    }

    location / {
        try_files       $uri $uri/ =404;
    }

    location /assets {
        access_log      off;
        expires         10d;
    }

    location ~ \.php$ {
        fastcgi_pass            unix:/run/php/php8.3-fpm.sock;
        fastcgi_index           index.php;
        fastcgi_param           SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include                 fastcgi_params;
    }
}

Ubuntu 24.04 uses PHP 8.3 by default, so the PHP-FPM socket path is /run/php/php8.3-fpm.sock.

Once you are done, start and enable the services with the commands below:

systemctl restart zabbix-server zabbix-agent2 nginx php8.3-fpm
systemctl enable zabbix-server zabbix-agent2 nginx php8.3-fpm

At this point, you can access the Zabbix web installer with:

http://zabbix.example.com:8080

Follow the setup wizard provided by Zabbix:

  • Check that all PHP modules and configuration pass the requirements.
  • Set DB connection: host localhost, DB zabbix, user zabbix, password Strong_Zabbix_DB_Pass.
  • Set server name and default timezone.

Then, log in with:

  • Username: Admin
  • Password: zabbix

Remember to change the password immediately.

Step 3. Install Zabbix Agent2 on GPU Nodes

After you set up the Zabbix server, you must add and install the Zabbix Agent2 on each GPU node you want to monitor. To do this, you can run the commands below:

sudo -s
wget https://repo.zabbix.com/zabbix/7.4/ubuntu/pool/main/z/zabbix-release/zabbix-release_7.4-1+ubuntu24.04_all.deb
dpkg -i zabbix-release_7.4-1+ubuntu24.04_all.deb
apt update
apt install zabbix-agent2 zabbix-agent2-plugin-* -y

The plugin packages include the NVIDIA monitoring plugin required for the official template.

Then, open the Zabbix Agent2 config file with the command below:

nano /etc/zabbix/zabbix_agent2.conf

Adjust the values as shown below:

Server=10.0.0.10
ServerActive=10.0.0.10
Hostname=gpu-node-01
HostnameItem=system.hostname
ListenPort=10050
Timeout=10
Include=/etc/zabbix/zabbix_agent2.d/*.conf
  • Server and ServerActive: The IP address or DNS name of the Zabbix server.
  • Hostname: Must match the hostname configured in the Zabbix UI.

Enable and start the Zabbix Agent2 with:

systemctl enable --now zabbix-agent2

Check if it is up and running:

systemctl status zabbix-agent2 --no-pager

Step 4. Enable the Official NVIDIA GPU Template by Zabbix Agent2

You can directly connect Zabbix to NVIDIA’s NVML layer using the official NVIDIA GPU Template by Zabbix Agent2.

By linking this template to your GPU hosts, Zabbix can automatically discover all installed NVIDIA GPUs and start collecting detailed statistics such as utilization, memory usage, temperature, power draw, and driver version.

This gives you deep visibility into the health and performance of every GPU in your cluster.

This only works for Zabbix server 7.4+ and agent2 with the NVIDIA plugin. Be sure NVIDIA drivers are installed on the GPU node.

To link this template to a GPU host, follow the steps below:

From the Zabbix UI, navigate to the following path:

Configuration → Hosts → Create host

Set the following values:

Host name: gpu-node-01.
Groups: create/use GPU Servers.
Interfaces: Agent, IP 10.0.0.21, port 10050.

In the Templates tab, link:

  • Linux by Zabbix agent (OS baseline).
  • Nvidia by Zabbix agent 2 or Nvidia by Zabbix agent 2 active, depending on your package version.

Save the host; Zabbix will begin polling the node and discovering GPUs.

From the Zabbix server, test availability with an NVML key:

zabbix_get -s 10.0.0.21 -k nvml.system.driver.version

If the configuration is correct, this returns the NVIDIA driver version from the GPU node.

Visualizing NVIDIA GPU Metrics and Alerts in Zabbix

After you start collecting NVIDIA GPU data in Zabbix, you need a simple way to see what is happening and get alerts when something goes wrong.

In this step, you will check live GPU metrics per host, build easy‑to‑read dashboards, and set basic triggers so Zabbix can notify you about overheating GPUs, sustained high utilization, or memory pressure before they impact your workloads.

1. View latest GPU data:

From the Zabbix UI, navigate to the following path:

Monitoring → Hosts → gpu-node-01 → Latest data

You can filter by application or search for keys like:

nvml.device.utilization.gpu
nvml.device.memory.used
nvml.device.temperature.gpu
nvml.device.power.draw

You should see periodically updated numeric values for each discovered GPU.

2. Create GPU-focused dashboards:

Navigate to the following path:

Monitoring → Dashboards → Create dashboard

Add widgets:

  • Graph widget showing GPU utilization across GPU nodes.
  • The top hosts widget filtered by the GPU Servers group to highlight the highest GPU load.

These widgets turn the raw NVML numbers into an easy‑to‑read overview of your whole GPU cluster.

3. Trigger tuning for GPU workloads:

With the NVIDIA template, you can create new triggers or change the existing ones:

  • High GPU temperature, for example, above 80°C for 5 minutes.
  • GPU utilization above 90% for 15 minutes on training nodes.
  • GPU memory utilization above 95% indicates OOM risk.

Configure these under:

Configuration → Templates → [Nvidia template] → Triggers

Bind actions like email or webhook under:

Configuration → Actions

That’s it, you are done. You can now monitor every GPU in your infrastructure.

Zabbix GPU Monitoring

FAQs

Do I need Zabbix agent2 for GPU monitoring?

For NVIDIA GPU monitoring with the official template, you must use Zabbix agent2 because the NVIDIA GPU plugin is only available for agent2, not the classic agent.

How many GPU servers can a single Zabbix server handle?

With proper tuning, a single Zabbix server can monitor dozens of GPU nodes and hundreds of GPUs, but very large clusters may benefit from Zabbix proxies and a more powerful database backend.

Is it safe to run Zabbix on the same server as my GPU workloads?

It is technically possible in small environments, but it is recommended to run the Zabbix server on a separate VM or host, so monitoring is not impacted by GPU workloads.

Conclusion

By combining Zabbix 7.4 on Ubuntu 24.04 with Zabbix agent2 and the official NVIDIA GPU template, you get a powerful, centralized monitoring stack tailored for dedicated GPU servers.

This setup lets you automatically discover all GPUs, track key metrics such as utilization, memory, temperature, and power, and build dashboards and alerts that help you detect problems early, protect your hardware, and keep AI, rendering, and HPC workloads running smoothly.

We hope you enjoy this Zabbix GPU Monitoring guide. Subscribe to our X and Facebook channels to get the latest updates and articles on GPU monitoring.

For further reading:

Distributed Deep Learning with Horovod

JupyterHub Setup on a GPU Server

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.