
NVMe Optimization for Linux Servers
As you must know, NVMe SSDs are much faster than old hard drives, but a standard Linux setup won’t always get the most out of them. Several software settings, like power management, how data is organized, and the system’s memory, can impact the speed and cause delays. For servers or applications where performance matters, you must test and adjust these settings to avoid slowdowns. In this article, you will learn NVMe optimization for Linux servers.
You can try PerLod’s high-performance dedicated servers that come with fast NVMe storage and let you control every setting. It is perfect for powerful databases, virtual machines, or any app that needs quick responses.
Table of Contents
Prerequisites for NVMe Optimization for Linux Servers
Before we dive into the steps, it is important to back up sensitive data. Also, you must consider that running some tests will overwrite data; you should not run them on a production filesystem unless you intend to test that filesystem.
You must have root access to your Linux server and can reboot and edit bootloader (GRUB) configs if needed.
You need some tools for setting NVMe features, benchmarking, and monitoring tools. Install these packages on your Linux server with the following command.
On Ubuntu/Debian:
sudo apt update
sudo apt install nvme-cli fio util-linux iotop sysstat numactl -y
On RHEL/Alma/Rocky/CentOS:
sudo dnf update -y
sudo dnf install nvme-cli fio util-linux iotop sysstat numactl -y
Discover and List NVMe Devices in Linux
First of all, you must list your NVMe devices and check your metrics. To do this, run the following commands.
List NVMe devices with:
nvme list
Check controller capabilities such as queues, MDTS, etc.:
nvme id-ctrl -H /dev/nvme0
Check Namespace info, including LBA formats and RP ranking:
nvme id-ns -H /dev/nvme0n1
Run a health check for temperature, media errors, and percentage used:
nvme smart-log /dev/nvme0
Check pre-device service times and queue depth:
iostat -x 2 10
Before you start tuning your system, it’s helpful to get a quick baseline of your SSD’s performance.
For 4K random read latency, direct I/O, and short run:
fio --name=randread --filename=/dev/nvme0n1 --rw=randread --bs=4k \
--iodepth=64 --numjobs=1 --ioengine=io_uring --direct=1 --time_based=1 --runtime=20
For 128K sequential read throughput:
fio --name=seqread --filename=/dev/nvme0n1 --rw=read --bs=128k \
--iodepth=64 --numjobs=1 --ioengine=io_uring --direct=1 --time_based=1 --runtime=20
NVMe Power Management for Low Latency (APST)
APST (Autonomous Power State Transition) is the NVMe power-saving feature. For the lowest and most consistent latency, you can disable the drive’s power-saving mode (APST). This prevents the drive from entering deep sleep states that cause small wake-up delays.
You can do this by adding a single parameter to your system’s boot loader. Here’s how to do it for GRUB, which is the most common boot loader for Linux:
sudo sed -i 's/GRUB_CMDLINE_LINUX="/GRUB_CMDLINE_LINUX="nvme_core.default_ps_max_latency_us=0 /' /etc/default/grub
sudo update-grub
Once you are done, remember to reboot the system to apply the changes.
Choose the Right I/O Scheduler for NVMe
Another key setting that affects performance is the I/O Scheduler. This controls how the operating system organizes and sends data requests to your drive.
On modern systems like Red Hat, the default for NVMe drives is usually none, which is ideal. If your system uses a different scheduler like kyber or bfq, switching to none can reduce overhead and lower latency. You can check and change the scheduler with the following commands.
See available schedulers and the current one:
cat /sys/block/nvme0n1/queue/scheduler
To change to none, you can run:
echo none | sudo tee /sys/block/nvme0n1/queue/scheduler
Or, for mixed and contended write workloads:
echo mq-deadline | sudo tee /sys/block/nvme0n1/queue/scheduler
NVMe CPU Affinity Tuning (rq_affinity)
When an I/O operation completes, the system sends a notification. By default, this “completion” might be handled by any available CPU core, which can cause cache inefficiencies as data is passed between cores.
You can set rq_affinity to 2, which forces completion to be processed on the same CPU core that submitted the request. This improves cache locality and can reduce latency for high-performance NVMe drives. To do this, you can run:
echo 2 | sudo tee /sys/block/nvme0n1/queue/rq_affinity
This command will set temporarily. To make it permanent, you can create a udev rule that also applies your scheduler setting automatically every time the system boots:
sudo nano /etc/udev/rules.d/60-nvme-tuning.rules
Add:
ACTION=="add|change", KERNEL=="nvmen", ATTR{queue/rq_affinity}="2", ATTR{queue/scheduler}="none"
Reload to apply the changes:
sudo udevadm control --reload && sudo udevadm trigger
Optimizing NVMe for NUMA Systems
On servers with multiple CPU sockets (NUMA systems), your NVMe drive is physically connected to one specific CPU. To achieve the lowest latency and highest performance, both the drive’s interrupt and your application should run on the CPU cores that are local to that drive. This avoids slow communication across the motherboard.
Here we will show you how to pin the NVMe’s interrupts to the correct CPUs and launch your application on the same NUMA node.
First, find the NVMe drive’s interrupt requests (IRQs):
grep -i nvme /proc/interrupts
Then, for example, pin all NVMe0 IRQs to CPUs 0-3:
for i in $(grep -i nvme0 /proc/interrupts | awk -F: '{print $1}'); do
echo 0-3 | sudo tee /proc/irq/$i/smp_affinity_list
done
Next, run your DB or app threads on the same NUMA node:
numactl --cpunodebind=0 --membind=0 your-binary …
Fine-Tuning NVMe Block Layer Settings
The Linux block layer has a few more settings that can impact performance. The two main ones are “read-ahead,” which preloads data, and the “maximum request size.”
- Read-ahead helps read large files sequentially, but can waste resources on random workloads.
- Maximum request size should generally align with your drive’s preferred I/O size for efficiency.
Here we will show you how to check and adjust the NVMe block layer settings. Remember to use these carefully; the defaults are often good.
Read-ahead (read_ahead_kb): This setting tells the kernel how much extra data to read in advance. A small value like 128 KB is ideal for random I/O like databases, while a large value, such as 4096 KB, benefits big sequential reads.
echo 128 | sudo tee /sys/block/nvme0n1/queue/read_ahead_kb #Random-IO
echo 4096 | sudo tee /sys/block/nvme0n1/queue/read_ahead_kb #Large sequential reads
Maximum Request Size (max_sectors_kb): This caps the size of a single I/O request. It should be a multiple of your drive’s optimal I/O size and must not exceed the hardware’s absolute maximum.
Check and set if needed:
cat /sys/block/nvme0n1/queue/optimal_io_size
cat /sys/block/nvme0n1/queue/max_hw_sectors_kb
sudo sh -c 'echo 1024 > /sys/block/nvme0n1/queue/max_sectors_kb' #example: 1024 KB
Manage NVMe TRIM for Long-Term Performance
To keep your NVMe drive performing well over time, you should periodically run a TRIM command. This helps the drive manage its storage space efficiently. It is recommended to schedule TRIM as a weekly batch job. Running TRIM on every file deletion can cause performance overhead.
To check for TRIM support and enable Linux’s built-in weekly fstrim service, you can run the commands below:
lsblk --discard #non-zero DISC-GRAN which means TRIM supported
systemctl enable --now fstrim.timer
systemctl status fstrim.timer
Choosing and Mounting Filesystems for NVMe
The filesystem is the final layer between your application and the NVMe drive. Selecting an appropriate filesystem and mount options ensures you don’t add unnecessary overhead.
- XFS is the powerful and high-performance default for Linux servers. Its settings are excellent for NVMe drives. The main optional tuning is adding noatime to prevent recording file access times, which reduces write operations.
- ext4 is also a solid choice. The same principle applies: use the defaults and consider the noatime option. Avoid using the discard mount option for continuous TRIM, as the scheduled fstrim.timer is more efficient.
Sticking with default options is generally safe and effective for both filesystems.
Advanced NVMe Setup: Partition Alignment and Sector Size
For optimal performance, it’s important to set up your NVMe drive correctly from the start. This involves two key steps:
- Partition Alignment: Ensure partitions start on the right boundaries to match the drive’s internal structure.
- Sector Size: Some drives support larger, more efficient data block sizes (like 4KB instead of 512B).
Using the correct settings here can provide a noticeable performance boost.
Warning: Changing the sector size requires a full format and erases all data.
Create GPT partitions on 1 MiB boundaries:
sudo parted -s /dev/nvme0n1 mklabel gpt
sudo parted -s /dev/nvme0n1 mkpart primary 1MiB 100%
sudo parted -s /dev/nvme0n1 align-check optimal 1
This ensures optimal alignment.
LBA size: Check RP (relative performance) and current format:
nvme id-ns -H /dev/nvme0n1 | grep -E 'LBA Format|Relative'
To switch, for example, select LBAF index 1:
sudo nvme format /dev/nvme0n1 --lbaf=1
Note: LBA formats and NVMe format semantics are defined in the nvme-cli docs or the NVMe spec.
Advanced NVMe Controller Tuning
Beyond OS and filesystem settings, the NVMe controller itself has programmable features that influence its operation. Key features include:
- Queue Counts: Viewing how many command queues are allocated.
- Interrupt Coalescing: Balancing latency against CPU overhead by controlling how long the drive waits to notify the CPU.
- Volatile Write Cache: A dangerous but potential performance boost that risks data loss on power failure.
Warning: Changing these, especially the write cache, can lead to system instability or data loss.
Queue counts:
nvme get-feature /dev/nvme0 -f 7 -H
Interrupt coalescing:
# Example: 50 microseconds, threshold 10 (value encoding is vendor/cli-defined)
nvme set-feature /dev/nvme0 -f 0x08 -v 0x320a
nvme get-feature /dev/nvme0 -f 0x08 -H
Volatile Write Cache:
nvme get-feature /dev/nvme0 -f 0x06 -H # if supported by your SSD
nvme set-feature /dev/nvme0 -f 0x06 -v 1 # enable (risk: data loss on power loss)
Make NVMe Tuning Changes Permanent
These NVMe changes will be lost after a reboot. To ensure your changes are permanent, you can create a udev rule. This is a system service that automatically applies your chosen settings every time your NVMe drive is detected, like when booting.
The following rule sets the scheduler to none, CPU affinity to 2, and read-ahead to 128 KB for all NVMe drives.
sudo nano /etc/udev/rules.d/60-nvme-tuning.rules
Add:
ACTION=="add|change", KERNEL=="nvmen", \
ATTR{queue/rq_affinity}="2", \
ATTR{queue/scheduler}="none", \
ATTR{queue/read_ahead_kb}="128"
Then, reload to apply the changes:
sudo udevadm control --reload
sudo udevadm trigger
Validate NVMe Changes | Re-measure and Compare
The only way to know if your tuning was successful is to measure again. After applying your persistent settings, you must re-run your original benchmarks to compare the results.
You can rerun your fio baselines and operational metrics:
iostat -x 2 10
fio … #same jobs as baseline
nvme smart-log /dev/nvme0
Look for improvements in average latency, 99th percentile (tail) latency, and IOPS.
Applying a Low-Latency NVMe Profile with a Script
You can also quickly apply the recommended settings for the scheduler, CPU affinity, read-ahead, and TRIM scheduling with the script below. It combines the most impactful and safe tuning steps into a single script.
#!/usr/bin/env bash
dev=/dev/nvme0n1
sys=/sys/block/$(basename $dev)/queue
# Lean path & CPU-local completions
echo none | sudo tee $sys/scheduler
echo 2 | sudo tee $sys/rq_affinity
# Random-read friendly read-ahead
echo 128 | sudo tee $sys/read_ahead_kb
# Enable weekly TRIM, not continuous discard
sudo systemctl enable --now fstrim.timer
# Show summary
echo "Scheduler: $(cat $sys/scheduler)"
echo "rq_affinity: $(cat $sys/rq_affinity)"
echo "read_ahead_kb: $(cat $sys/read_ahead_kb)"
Important Considerations and Warnings for NVMe Tuning
Before you conclude your tuning process, keep these critical points in mind:
Firmware/BIOS: Outdated SSD firmware is a common source of latency issues. Keep your drive firmware up to date.
APST Trade-off: Disabling power management (APST) increases performance but also generates more heat. Ensure your system has suitable cooling.
Data Loss Warning: Changing the LBA format (sector size) is a destructive operation that erases all data on the drive. Always back up first.
Workload Specificity: Always validate tuning against your actual application.
FAQs
Should I always disable APST for NVMe?
Not necessarily. Disabling APST improves latency but increases power draw and temperature. For servers prioritizing responsiveness, it’s worth it; for laptops, it’s not.
How often should I run NVMe TRIM (fstrim)?
Once per week is ideal for most use cases.
Does formatting the NVMe drive to 4K sectors improve performance?
Only if your workload benefits from fewer, larger I/O operations. Always benchmark before reformatting.
Conclusion
NVMe Optimization for Linux Servers is about understanding how your hardware and kernel interact. By aligning CPU cores with NVMe interrupts, using the correct I/O scheduler, disabling NVMe power saving, and scheduling smart TRIM operations, you can unlock the true potential of your NVMe drives.
If you’re deploying on a managed or cloud environment, choose hosting like PerLod Hosting that offers native NVMe storage and kernel-level control.
We hope you enjoy this guide. Subscribe to X and Facebook channels to get the latest articles for optimizing your Linux servers.
Monitoring I/O Latency on Dedicated Servers