Troubleshooting NUMA Latency on Multi-Socket Servers
Modern high-performance infrastructure usually relies on multi-socket hardware to deliver massive core counts and memory capacity. However, having more resources does not guarantee better speed. On a Dedicated Server with multiple CPUs, the Non-Uniform Memory Access (NUMA) architecture can decrease performance if the operating system and software are not NUMA-aware. This guide aims to identify NUMA performance issues and implement the most effective fixes.
If you need high-performance Dedicated Servers for your critical applications, Perlod Hosting delivers the raw power, but understanding NUMA is key to getting every bit of speed out of them.
Table of Contents
NUMA Architecture: How Interconnects Affect Latency
In a NUMA system, each CPU socket has its own local memory. Accessing this local memory is fast, but accessing memory attached to a different CPU (remote memory) is slower and passes through an interconnect link like Intel UPI or AMD Infinity Fabric.
NUMA performance issues occur when a process running on CPU 0 is forced to constantly fetch data from the memory of CPU 1, resulting in high latency and saturation.
Symptoms of NUMA Performance Issues
NUMA performance issues often look like your server is overloaded, but they have unique warning signs.
Inconsistent latency:
The most common sign is inconsistent latency; an application might process some requests instantly (local memory hit) while others lag significantly (remote memory access).
Note: If you see latency spikes but your memory allocation looks balanced, the issue might not be NUMA-related; check our guide on Detecting IRQ Imbalance Latency on Linux to find interrupt issues.
High System CPU usage:
Also, you may see high system VPU usage (%sys) even when the user load is moderate. This happens because the operating system is managing memory pages across nodes or waiting on the interconnect. In database environments, you may also see databases freeze as the OS tries to reclaim memory from a local node, even though gigabytes are free on a remote node.
Resource imbalance:
Another symptom of NUMA performance issues is resource imbalance. One CPU socket might be at 100% load with its attached memory full, while the second socket sits idle with empty RAM. If your application is not pinned or balanced, the OS scheduler might keep all threads on a single node. This blocks that single CPU while the rest of your server sits idle.
Now that you have understood the symptoms of NUMA performance issues, you can proceed to the next step to see how you can detect them.
Find NUMA Performance Issues
At this point, you can verify your hardware topology and monitor how memory is actually being allocated. This will help you detect the performance issues.
Identify if your server is actually using NUMA by running the command below:
numactl --hardware
You will see the available nodes and the distance between them. A distance of 10 represents local memory, while 20, 21, or higher represents remote memory.
Note: If you see only node 0, your server is UMA (Uniform Memory Access) or has NUMA disabled in BIOS.
To detect the issues, you can use the numastat command with the -m flag to show system-wide memory usage per node in megabytes:
numastat -m
To see if processes are failing to get local memory, use the -z flag to skip zero columns and watch for the numa_miss and numa_foreign rows:
watch -n 1 numastat -z
- numa_hit: Good. Memory was allocated on the intended local node.
- numa_miss: Bad. The process wanted memory on this node, but it was full, so it had to allocate elsewhere.
- numa_foreign: Bad. Memory allocated on this node was actually intended for a different node.
High growth in miss and foreign counters confirms a locality issue.
Once you identified the NUMA performance issues, you can implement some strategies to fix these issues. To do this, proceed to the next step.
Resolving NUMA Bottlenecks
Fixing NUMA bottlenecks is all about taking control back from the operating system. Instead of letting the server guess where to store data, you can use these strategies to force efficient memory usage.
Note: Optimizing memory locality is the first step in high-performance tuning. To fully maximize your server’s throughput after fixing NUMA, you should also apply Linux Kernel Network Tuning.
Disable Zone Reclaim Mode
By default, some Linux kernels set vm.zone_reclaim_mode to 1, which forces the OS to reclaim memory from the local node rather than using free remote memory. This causes high latency spikes.
For most database and application workloads, this should be disabled to let the OS prefer remote allocation.
Check the current mode with the command below:
sysctl vm.zone_reclaim_mode
To disable zone reclaim mode, you must set it to 0:
sysctl -w vm.zone_reclaim_mode=0
To make it permanent, edit the /etc/sysctl.conf le and the following line:
vm.zone_reclaim_mode = 0
Bind Processes to Specific CPUs
For maximum performance, you can force an application to run only on a specific CPU socket and use only its local memory. This option is great for running multiple instances of a database, for example, a MySQL instance per socket.
For example, run myapp only on CPU node 0 and use only memory from node 0 with the command below:
numactl --cpunodebind=0 --membind=0 myapp
This guarantees 100% local memory hits but limits the application to the resources of a single socket.
Spread RAM Across All CPUs
If an application is large, multi-threaded, and needs more RAM than a single socket provides, like MongoDB or large in-memory caches, restricting it to one node will cause NUMA performance issues. Instead, you can use interleaving, which spreads memory pages across all nodes (Round-Robin).
For example, run myapp spreading memory across all available nodes with the command below:
numactl --interleave=all myapp
While this is slower than using only local memory, it prevents random requests from getting stuck with slow access. It also doubles your overall memory speed by using all channels at once.
Note: For architectures that require scaling memory capacity beyond the physical limits of a single server, explore our tutorial on Memory Disaggregation.
FAQs
Which one should I use? CPU Binding or Memory Interleaving?
Use binding for standalone databases like MySQL that can fit entirely inside one CPU’s memory. Use Interleaving for large apps like MongoDB or Java that need more RAM than a single CPU socket offers.
Is disabling zone reclaim mode safe?
Yes. It’s almost always better to access slower remote memory than to have your application freeze while the OS tries to clean up local memory.
Can I fix NUMA issues in the BIOS?
Yes. Enable Node Interleaving in the BIOS to make all memory access uniform. This is the easiest solution, preventing the need for complex commands, though it offers lower peak performance than manual tuning.
Conclusion
Modern Dedicated Server hardware offers huge power, but raw CPU power means nothing if your data is stuck in traffic between sockets. NUMA performance issues are silent killers; they don’t crash your server, they just make it sluggish. Unlock your hardware’s true performance by prioritizing memory locality. Whether you pin for performance or interleave for stability, the goal is control. Don’t let the operating system guess; tell it exactly where to handle your data.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles to increase your performance.