vLLM Optimization Guide: GPU Memory, Quantization, and Throughput Tuning
Running large language models requires speed and stability. If your setup suffers from out-of-memory errors or low tokens per second, you need a proven strategy to fix it. This is why following a reliable vLLM optimization guide is so important for your operations.
If you have already followed our basic setup to host an OpenAI-compatible API server, you are ready to push your hardware further. Let’s look at how to tune your parameters for the best throughput.
Table of Contents
The Ultimate vLLM Optimization Guide
Before changing any settings, it is important to know which parameters actually impact your performance. This section of the vLLM optimization guide shows everything you need to maximize your hardware, increase your tokens per second, and keep your server completely stable under heavy traffic.
1. GPU Memory Utilization
Memory management is the first step in any proper vLLM optimization guide. By default, the server reserves 90% of your GPU VRAM for model weights and the KV cache. The remaining 10% acts as a safety buffer for temporary processes.
To increase your context window and throughput, you must adjust the memory flag. Many users following this vLLM optimization guide push this value to 0.95. This allows more space for the KV cache, which directly improves tokens per second.
vllm serve "meta-llama/Llama-3-8B" --gpu-memory-utilization 0.95
2. Context Length Limiting
Another useful step in this vLLM optimization guide is limiting the maximum context length when your workload does not need the model’s full window. Using –max-model-len can reduce memory pressure and give vLLM more space to work with available GPU memory, especially on smaller cards.
For example, if your application only needs 4096 tokens, there is no reason to keep an 8192-token limit.
vllm serve "meta-llama/Llama-3-8B" \
--gpu-memory-utilization 0.95 \
--max-model-len 4096
This setting is helpful for chatbots, internal tools, and API workloads where prompts are usually short. It is one of the simplest ways to improve stability before moving to bigger hardware.
3. Batching Behavior
To get high throughput, you need to configure batching correctly. Every solid vLLM optimization guide highlights the maximum sequences and batched tokens parameters. The sequence flag limits concurrent requests, while the tokens flag limits the total tokens processed in a single step.
A practical vLLM optimization guide tip is to set batched tokens to 8192 or 16384. This forces the GPU to process larger batches, giving you higher tokens per second:
vllm serve "meta-llama/Llama-3-8B" --max-num-seqs 512 --max-num-batched-tokens 8192
4. Chunked Prefill
After batching, you should also look at how vLLM handles very large prompts. Enabling –enable-chunked-prefill helps the server process long inputs in smaller pieces instead of letting one huge prompt take over the queue, which makes request handling smoother under mixed workloads.
vllm serve "meta-llama/Llama-3-8B" \
--max-num-seqs 512 \
--max-num-batched-tokens 8192 \
--enable-chunked-prefill
This is especially useful when your traffic includes both short chat requests and long document prompts. In that case, chunked prefill can help protect against latency and keep throughput more stable across users.
5. Quantization Options
If your model is too big for your VRAM, quantization shrinks it. In this vLLM optimization guide, we compare FP8 and AWQ formats.
FP8 is the fastest and most efficient choice if you have a modern GPU like the H100 or RTX 4090. It cuts memory needs in half and boosts speed by up to 1.6x. If you cannot use FP8, AWQ is generally faster than standard GPTQ.
Always test your quantized models to ensure the output quality remains high.
vllm serve "neuralmagic/Meta-Llama-3-8B-Instruct-FP8" --quantization fp8
6. Model Sizing and Tensor Parallelism
When one GPU is not enough, you can split the model using tensor parallelism. A common rule in any vLLM optimization guide is to use the smallest number of GPUs that fit your model.
For example, splitting a model across four GPUs when two are enough will actually slow down your tokens per second due to network delays. A major vLLM optimization recommendation is to add more GPUs only when the model strictly requires it for memory.
vllm serve "meta-llama/Llama-3-70B" --tensor-parallel-size 2
When to Upgrade Your Hardware
If your GPU utilization is maxed out and requests are queuing, it is time to upgrade your hardware. For businesses scaling up, moving to a Dedicated Server with GPU provides the power and unshared VRAM needed for heavy traffic. With dedicated enterprise cards, you get stable tokens per second without the shared-resource limits of smaller setups.
Conclusion
Tuning your inference server is an ongoing process. By adjusting memory limits, expanding batch sizes, and using FP8 quantization, you can maximize your hardware. When you are ready to deploy models in production with guaranteed uptime, explore reliable AI Hosting solutions to keep your applications running smoothly.
We hope you enjoy this guide. For more information, you can check the official vLLM documentation.
FAQs
What is the best GPU memory setting for vLLM?
According to the best practices, setting the utilization flag to 0.95 gives you the largest KV cache while keeping a safe 5% buffer to prevent crashes.
How do I increase tokens per second for vLLM?
Increase your batched tokens limit to 8192 or higher. This vLLM optimization guide confirms that larger batch sizes keep the GPU busy and improve overall throughput.
Which quantization method is fastest for vLLM?
FP8 is currently the fastest and most memory-efficient option on newer GPUs. If you are using INT4 models, AWQ usually performs slightly better than GPTQ.