How to Size a GPU Server for RAG: Embeddings, Vector DB, and Inference in One Stack
Many engineering teams assume that buying the biggest graphics card on the market will solve all their AI infrastructure problems. However, when you find how to size a GPU server for RAG, you quickly realize it is a full-stack problem. A Retrieval-Augmented Generation (RAG) pipeline is not a single application; it is a complex workflow made up of multiple moving parts, including document embeddings, vector search, reranking, and language model inference.
If your vector database does not have enough RAM, retrieval gets slow, and your GPU ends up waiting for data. To size a GPU server for RAG, you need to plan the full stack, including compute, memory, and storage. This guide explains what each layer needs and shows practical sizing examples for different use cases.
Table of Contents
Why the GPU Alone Is Not Enough for RAG
A common mistake when learning how to size a GPU server for RAG is focusing only on the language model. While the LLM requires massive graphical compute power to generate text, the quality and speed of an RAG system depend entirely on what happens before the LLM even sees the prompt.
The RAG process follows a strict sequence of events:
- First, the user submits a query, which an embedding model converts into a mathematical vector.
- Next, the system searches a vector database to find similar documents. Often, a reranking model reorders these results to improve accuracy.
- Finally, the retrieved text and the user’s prompt are sent to the LLM for inference.
Every step in this process requires different hardware components:
1. System RAM: Vector databases like Qdrant, Milvus, and Weaviate store index data in system memory for fast retrieval. A database with millions of vectors can easily consume hundreds of gigabytes of RAM.
2. CPU Cores: High-speed CPUs are required to handle incoming API requests, manage database queries, and route data between the storage layer and the graphics cards.
3. Fast Storage: NVMe SSDs are required. Vector databases write massive amounts of data to disk, and slow SATA or HDD storage will create bottlenecks during document ingestion and search.
4. Network Latency: If you split your database and your inference models across different data centers, network latency will ruin your time-to-first-token (TTFT). Keeping the stack on a single server or local network minimizes this delay.
Hardware Requirements for Every RAG Layer
This is the most essential part of how to size a GPU server for RAG. You must calculate the requirements for embeddings, the vector database, and inference separately.
The Embedding and Reranking Models
Embedding models convert text into numbers, and compared to LLMs, they are very small. For example, a model like E5-mistral-7b running at 16-bit precision requires about 14 GB of VRAM, while smaller models like stella_en_1.5B require only around 3 GB.
Reranking models are similarly sized. You can run these models on older or smaller hardware, or you can allocate a dedicated portion of your main server’s VRAM specifically for them.
The Vector Database
Fast vector search requires plenty of system RAM. Keeping your data in memory reduces search times to milliseconds. To figure out how much RAM you need, you must look at your total document count and vector size.
Indexing 1 million vectors at 768 dimensions requires about 3 GB of memory, but scaling up to 10 million vectors at 1536 dimensions requires 61 GB of dedicated memory.
For enterprise setups, high-memory servers with 250 GB to over 1 TB of RAM and dual-socket processors like AMD EPYC are recommended.
If you decide to split your architecture and host your database separately from your inference models, you don’t necessarily need a high-end GPU for the database alone. You can easily run the Milvus vector database on a VPS to handle vector storage and retrieval efficiently.
LLM Inference
Inference is where you need raw VRAM. A standard 8-billion parameter model like Llama 3 8B running at 8-bit precision needs about 8 GB of VRAM just to load the weights. However, you also need extra VRAM for the KV cache, which stores the context window data.
RAG prompts are incredibly long because they include retrieved document text. If you want to process large context windows concurrently, you need graphics cards with 24 GB to 80 GB of VRAM.
When sizing for enterprise workloads, you might wonder if you can fit massive language models onto a single machine. To explore the exact VRAM math and hardware setups required for this, read this guide on whether you can run a 70B model on one server.
If you are planning this inference layer, deploying a GPU dedicated server ensures you have unshared, bare-metal access to the compute power required for fast token generation.
How to size a GPU server for RAG with Practical Scenarios
Hardware sizing changes depend on your user base and the amount of data you need to index. Below are three deployment scenarios for how to size a GPU server for RAG:
1. Small Internal RAG Setup
For an internal setup, how to size a GPU server for RAG is mostly about controlling hardware costs while maintaining an accurate knowledge base. This scenario applies to internal company wikis, HR policy bots, or technical documentation search tools.
At this scale, you are generally processing around 1 million documents and handling fewer than 1,000 queries per day. Because user traffic is low, you can colocate all components on a single machine to keep the architecture simple and remove network latency.
For this setup, you need:
- Compute: A modern server-grade CPU like Intel Xeon or AMD EPYC with 64 GB of system RAM.
- Storage: 1 TB of high-speed NVMe storage for the vector database and OS.
- Graphics Card: A single NVIDIA RTX 4090 with 24 GB VRAM or an NVIDIA L4.
2. Customer-Facing Chatbot
In a production environment, how to size a GPU server for RAG shifts toward reducing response times and handling concurrent users. This applies to e-commerce support bots, SaaS application features, or public knowledge bases.
At this scale, you might handle up to 10 million documents and 10,000 queries a day, and customers expect instant answers. By keeping your embedding model, vector database, and LLM on the same powerful server, you remove network delays and drop processing times to just milliseconds.
For these situations, you need:
- Compute: Dual-socket CPU configuration with 128 GB to 256 GB of system RAM to ensure the vector database never has to page memory to disk.
- Storage: 2 TB of NVMe storage per node to handle frequent read and write operations from the database.
- Graphics Card: Two NVIDIA RTX 4090s or two NVIDIA L4 GPUs. You can dedicate one card entirely to the embedding model and vector indexing, and the second card to LLM inference to ensure high throughput.
3. Multi-Tenant RAG Stack
For enterprise environments, how to size a GPU server for RAG requires careful multi-tenant planning. This applies to AI platforms serving thousands of different organizations, legal document analysis software, or massive regulatory compliance systems.
At this scale, you handle more than 10 million documents and massive daily traffic. Since you cannot fit everything on one basic GPU, you need a multi-server setup to maintain quick response times.
For this case, you need:
- Compute: Four high-density CPU servers with 256 GB to 1 TB of RAM each, powered by high-end processors like the dual EPYC 9634.
- Storage: 4 TB or more of NVMe storage per server node.
- Graphics Card: Four NVIDIA A100 or H100 cards, or 4 to 8 NVIDIA L4 GPUs.
To keep these expensive GPUs busy, large teams use tools like NVIDIA Run:ai to group their hardware together and automatically share the work.
Choose the Right Hardware for Your RAG Stack
Choosing the right setup can be tricky. Use this quick comparison table to see exactly what kind of hardware you need based on your document count and user traffic.
| Deployment Scenario | Document Count | System RAM | Storage Requirement | Recommended GPUs |
|---|---|---|---|---|
| Small Internal Setup | ~1 Million | 64 GB | 1 TB NVMe | 1x RTX 4090 or L4 |
| Customer-Facing Bot | ~10 Million | 128 GB to 256 GB | 2 TB NVMe | 2x RTX 4090 or L4 |
| Multi-Tenant Stack | 10M+ | 256 GB to 1 TB+ | 4 TB+ NVMe | 4x A100, H100 or 4x L4 |
Conclusion
Ultimately, how to size a GPU server for RAG depends on your document count and user traffic. You need enough RAM for your vector database, fast NVMe storage to prevent delays, and enough VRAM for your language models. Treating your embeddings, database, and inference as one connected stack is the best way to guarantee fast responses for your users.
If you are ready to deploy your architecture on hardware built specifically for these demands, you can build your RAG stack on PerLod AI infrastructure.
We hope you enjoy this guide.
Sizing your hardware is only halfway; once you understand your memory and compute needs, you actually have to deploy the software. For a step-by-step software architecture, check out this complete tutorial on building RAG pipelines using GPU servers.
FAQs
What is the most common bottleneck when deciding how to size a GPU server for RAG?
The most common bottleneck is system RAM and storage speed. Many teams buy powerful graphics cards but use slow storage or insufficient RAM for their vector database, which causes the GPU to sit idle while waiting for the database to retrieve context.
Do I need a separate server for the vector database?
For small to medium workloads, you can run the vector database and the inference models on the same server to reduce network latency. For enterprise-scale applications with tens of millions of documents, it is better to use dedicated high-memory CPU servers for the database.
Can I use HDD storage for the RAG system?
No. Vector databases perform heavy read and write operations during indexing and retrieval. HDDs are too slow and will degrade your system’s response time. Always use high-speed NVMe SSDs.