Best RAG Pipelines Using GPU Servers for Fast/Scalable AI

GPU-Optimized RAG Pipelines for Fast and Scalable AI

By Mila Harris
November 23, 2025

Building smart AI systems often starts with Retrieval-Augmented Generation (RAG). But as your data grows and your users expect faster answers, a basic RAG setup is no longer enough. This is where GPU-powered embeddings and model serving make a big difference. In this guide, you will learn to build RAG Pipelines using GPU Servers.

By moving heavy tasks like generating embeddings and running large language models onto GPUs, you can speed up response times, handle larger workloads, and get much better accuracy.

For developers and companies looking for reliable hardware, PerLod Hosting offers the best GPs, which are optimized for AI workloads.

Table of Contents

Building RAG Pipelines Using GPU Servers

In this guide, we assume you are running Ubuntu 22.04 or 24.04 on a server that has an NVIDIA GPU.

Our goal is to build a self-hosted RAG system that takes full advantage of the GPU Dedicated Server, which means running embeddings on the GPU, running the LLM on the GPU, and using a vector database (Qdrant) to store and retrieve documents.

Here are the three main parts we want to use:

Use Hugging Face’s Text Embeddings Inference (TEI) to create embeddings on the GPU.
Serve the large language model using vLLM, which gives us an OpenAI-compatible API and very fast GPU inference.
Store the vectors in Qdrant and access them through its Python client.

This setup gives you a fast, flexible, and fully self-hosted RAG system that takes full advantage of your GPU hardware.

Prerequisites for Setting Up a Fully Self-hosted RAG System

In this setup guide, we want to configure the Ubuntu server to run GPU-accelerated applications within Docker containers. You can easily create a robust environment that enables containerized software to access and utilize your GPU resources.

You must install NVIDIA drivers and CUDA by using the official docs or your cloud provider image. Just remember that the driver version must be compatible with CUDA 12.2+ because TEI images are built around that.

To verify your GPU model and driver, you can run the command below:

nvidia-smi

Install Docker on your server by using the following commands:

sudo apt update
sudo apt install ca-certificates curl gnupg -y

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
  | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) \
  signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" \
| sudo tee /etc/apt/sources.list.d/docker.list >/dev/null

sudo apt update
sudo apt install docker-ce docker-ce-cli containerd.io -y
sudo usermod -aG docker $USER

To apply that your user is in the Docker group, log out and log in again.

To set up the NVIDIA container toolkit, you can run the commands below:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey \
  | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list \
  | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' \
  | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt update
sudo apt install nvidia-container-toolkit -y

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

To confirm everything is working correctly, run nvidia-smi from inside a container:

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Deploy a GPU-Powered Embedding Service with Text Embedding Inference (TEI)

Once your server is configured with the necessary GPU and Docker tools, you can deploy a high-performance service for generating text embeddings.

In this step, we want to launch Hugging Face’s Text Embeddings Inference (TEI) server, which is a specialized tool designed for fast and efficient batch embedding generation on GPU hardware.

The first step is to choose the model and create the model directory. We use the popular BAAI/bge-large-en-v1.5 model.

Create a project directory and navigate to it with the commands below:

sudo mkdir -p ~/tei_data
cd ~/tei_data

Choose the model and create the directory with the commands below:

sudo export TEI_MODEL="BAAI/bge-large-en-v1.5"
sudo export TEI_VOLUME="$PWD/data"
sudo mkdir -p "$TEI_VOLUME"

Then, containerize it with Docker, and expose it as a web service on port 8080 with the command below:

docker run --gpus all \
  -p 8080:80 \
  -v "$TEI_VOLUME":/data \
  --pull always \
  ghcr.io/huggingface/text-embeddings-inference:1.8 \
  --model-id "$TEI_MODEL"

The first run will download model weights into /data to avoid re-download.

Once it is running, you can send text to its API endpoint and receive back embeddings, which are the fundamental building blocks for RAG. To do this, you can run the command below:

sudo curl 127.0.0.1:8080/embed \
  -X POST \
  -H "Content-Type: application/json" \
  -d '{"inputs":"What is Retrieval Augmented Generation?"}'

You should receive a JSON response that includes an array of numbers called embeddings.

Deploy a High-Performance GPU LLM Server with vLLM

The next essential component of a RAG backend is a high-performance Large Language Model (LLM) server. We want to deploy vLLM, which is a powerful inference engine known for its efficiency and OpenAI-compatible API.

You can easily create an isolated Python environment, install vLLM, and launch a server that provides standard OpenAI endpoints like /v1/chat/completions.

Set up the Python environment with the commands below:

sudo mkdir -p ~/rag_llm && cd ~/rag_llm
sudo python3 -m venv .venv
sudo source .venv/bin/activate

pip install --upgrade pip
pip install "vllm[all]" # or `pip install vllm` if you prefer minimal extras

You must choose a model that fits your GPU; for a 24 GB GPU, you can often run a 7B instruct model; for smaller VRAM, pick a smaller model.

For example, we deploy vLLM to serve the Qwen/Qwen3-8B model on port 8001 with the command below:

vllm serve Qwen/Qwen3-8B \
  --port 8001 \
  --host 0.0.0.0

This downloads the model from Hugging Face on the first run and starts the server at http://0.0.0.0:8001/v1.

Now, verify the LLM server with the following curl command:

curl http://127.0.0.1:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "messages": [
      {"role":"user","content":"Say hello from a PerLod GPU optimized RAG backend."}
    ],
    "max_tokens": 128
  }'

You should see a JSON response with a choices array containing the model output.

Start Qdrant Vector Database

At this point, you need a dedicated database to store and query the vector embeddings generated by the TEI server. We want to use Qdrant, which is a high-performance and production-ready vector database.

Qdrant acts as the long-term memory for your application, efficiently storing vector embeddings and enabling lightning-fast similarity searches. When a user asks a question, the RAG pipeline will query Qdrant to find the most relevant information before the LLM generates a final answer.

Run Qdrant vector database in Docker with a persistent volume to ensure your data survives when the container restarts:

sudo mkdir -p ~/qdrant_storage

docker run -p 6333:6333 -p 6334:6334 \
  -v "$HOME/qdrant_storage:/qdrant/storage:z" \
  qdrant/qdrant

REST API is available at: http://localhost:6333
The Web Dashboard is at: http://localhost:6333/dashboard
gRPC is available at: localhost:6334

Build the RAG System with Python: Ingestion and Query

In this step, you must make the services work together. You can create a RAG system with a Python service for the entire workflow.

Create the RAG Ingestion Service

This Python service will ingest your documents by chunking and embedding them with the TEI server, store the results in Qdrant, and then answer questions by retrieving relevant context and generating responses with the vLLM LLM server.

You must install the libraries inside the ~/rag_llm venv you have created, or you can create a separate env on your choice:

cd ~/rag_llm
source .venv/bin/activate

pip install qdrant-client httpx fastapi uvicorn tiktoken
pip install openai # for OpenAI compatible client

Here we use the ~/rag_llm directory as the project directory. Create a basic configuration file with the command below:

sudo nano ~/rag_llm/config.py:

Add the following configuration to the file:

# config.py
TEI_URL = "http://127.0.0.1:8080"          # TEI endpoint
QDRANT_URL = "http://127.0.0.1:6333"       # Qdrant REST
QDRANT_COLLECTION = "rag_docs"

VLLM_BASE_URL = "http://127.0.0.1:8001/v1" # vLLM OpenAI compatible
VLLM_MODEL = "Qwen/Qwen3-8B"

EMBEDDING_DIM = 1024  # match your TEI model dimension (check HF model card)
TOP_K = 5
CHUNK_SIZE = 512
CHUNK_OVERLAP = 64

Note: Check the embedding size on the model’s page. For example, BGE Large often uses 1024, but you should confirm this on Hugging Face.

Then, create a chunking file with the command below:

sudo nano ~/rag_llm/chunking.py

Add the following chunking configuration to the file:

# chunking.py
from typing import List

def simple_chunk(text: str, chunk_size: int = 512, overlap: int = 64) -> List[str]:
    """Simple word based chunking."""
    words = text.split()
    if not words:
        return []

    chunks = []
    start = 0
    while start < len(words):
        end = min(len(words), start + chunk_size)
        chunk = " ".join(words[start:end])
        chunks.append(chunk)
        if end == len(words):
            break
        start = end - overlap
    return chunks

Now you must create the Clients’ wrappers, including TEI, Qdrant, and LLM:

sudo nano ~/rag_llm/clients.py

Add the following configuration to the file:

# clients.py
import httpx
from typing import List, Dict, Any
from qdrant_client import QdrantClient, models
from openai import OpenAI

from config import TEI_URL, QDRANT_URL, QDRANT_COLLECTION, EMBEDDING_DIM, VLLM_BASE_URL

class EmbeddingClient:
    def __init__(self, base_url: str = TEI_URL):
        self.base_url = base_url.rstrip("/")

    async def embed(self, texts: List[str]) -> List[List[float]]:
        async with httpx.AsyncClient(timeout=60.0) as client:
            response = await client.post(
                f"{self.base_url}/embed",
                json={"inputs": texts},
            )
            response.raise_for_status()
            data = response.json()
            # TEI returns {"embeddings": [[...], ...]}
            return data["embeddings"]

class QdrantVectorStore:
    def __init__(self, url: str = QDRANT_URL, collection: str = QDRANT_COLLECTION):
        self.collection = collection
        self.client = QdrantClient(url=url)

    def ensure_collection(self, dim: int):
        if not self.client.collection_exists(self.collection):
            self.client.create_collection(
                collection_name=self.collection,
                vectors_config=models.VectorParams(
                    size=dim,
                    distance=models.Distance.COSINE
                ),
            )

    def upsert_points(self, vectors: List[List[float]], payloads: List[Dict[str, Any]]):
        assert len(vectors) == len(payloads)
        points = [
            models.PointStruct(
                id=i,
                vector=v,
                payload=payloads[i],
            )
            for i, v in enumerate(vectors)
        ]
        self.client.upsert(
            collection_name=self.collection,
            wait=True,
            points=points,
        )

    def search(self, query_vector: List[float], top_k: int = 5):
        result = self.client.query_points(
            collection_name=self.collection,
            query=query_vector,
            limit=top_k,
        )
        return result

class LLMClient:
    def __init__(self, base_url: str = VLLM_BASE_URL, api_key: str = "EMPTY"):
        # OpenAI client, but pointed to vLLM
        self.client = OpenAI(
            api_key=api_key,
            base_url=base_url,
        )

    def chat(self, model: str, messages: List[Dict[str, str]], max_tokens: int = 512):
        response = self.client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
        )
        return response.choices[0].message.content

Note: The Qdrant client code follows the official examples, using VectorParams for setting up the collection and query_points for searching.

Finally, create the ingestion script file with the following command:

sudo nano ~/rag_llm/ingest.py

Add the following script to the file:

# ingest.py
import asyncio
from pathlib import Path
from typing import List

from config import EMBEDDING_DIM, QDRANT_COLLECTION, TOP_K, CHUNK_SIZE, CHUNK_OVERLAP
from chunking import simple_chunk
from clients import EmbeddingClient, QdrantVectorStore

DOCS_DIR = Path("./documents")

async def ingest_directory():
    emb_client = EmbeddingClient()
    store = QdrantVectorStore()

    store.ensure_collection(EMBEDDING_DIM)

    all_chunks: List[str] = []
    payloads = []

    for path in DOCS_DIR.glob("*.txt"):
        text = path.read_text(encoding="utf8", errors="ignore")
        chunks = simple_chunk(text, CHUNK_SIZE, CHUNK_OVERLAP)
        for idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            payloads.append(
                {
                    "source": str(path.name),
                    "chunk_index": idx,
                    "text": chunk,
                }
            )

    # Embed in batches
    batch_size = 64
    vectors: List[List[float]] = []
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i : i + batch_size]
        batch_vectors = await emb_client.embed(batch)
        vectors.extend(batch_vectors)

    store.upsert_points(vectors, payloads)

    print(f"Ingested {len(all_chunks)} chunks into collection '{QDRANT_COLLECTION}'.")

if __name__ == "__main__":
    asyncio.run(ingest_directory())

Place your .txt files inside the ./documents folder:

sudo mkdir -p ~/rag_llm/documents

Now you can run the ingestion with the command below:

cd ~/rag_llm
python ingest.py

This script will:

Make sure the Qdrant collection exists with the correct vector size.
Split your documents into chunks and create embeddings using TEI.
Save the vectors and metadata into Qdrant.

Create the RAG Query Service

To make the RAG system accessible to frontend applications and other services, you need a standardized interface. You can create a FastAPI service that wraps the entire RAG pipeline into a clean HTTP endpoint.

Create the RAG query API file with the command below:

sudo nano ~/rag_llm/rag_api.py

Add the following configuration to create a /rag/query endpoint that takes a question and returns an answer with supporting context:

# rag_api.py
import asyncio
from typing import List

from fastapi import FastAPI
from pydantic import BaseModel

from config import TOP_K, VLLM_MODEL
from clients import EmbeddingClient, QdrantVectorStore, LLMClient

app = FastAPI()

emb_client = EmbeddingClient()
store = QdrantVectorStore()
llm_client = LLMClient()

class QueryRequest(BaseModel):
    question: str
    top_k: int | None = None

class QueryResponse(BaseModel):
    answer: str
    context: List[str]

@app.post("/rag/query", response_model=QueryResponse)
async def rag_query(req: QueryRequest):
    top_k = req.top_k or TOP_K

    # 1. Embed question
    vectors = await emb_client.embed([req.question])
    query_vec = vectors[0]

    # 2. Search in Qdrant
    result = store.search(query_vec, top_k=top_k)

    context_chunks = []
    for hit in result:
        payload = hit.payload or {}
        context_chunks.append(payload.get("text", ""))

    context_text = "\n\n".join(
        f"- {chunk}" for chunk in context_chunks if chunk.strip()
    )

    # 3. Build prompt
    system_prompt = (
        "You are a helpful assistant that answers using the provided context.\n"
        "If the context does not contain the answer, say you are not sure.\n"
    )

    user_prompt = (
        f"Context:\n{context_text}\n\n"
        f"Question: {req.question}\n\n"
        "Answer in a concise and accurate way based only on the context."
    )

    # 4. Call LLM via vLLM OpenAI compatible API
    answer = llm_client.chat(
        model=VLLM_MODEL,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ],
        max_tokens=512,
    )

    return QueryResponse(answer=answer, context=context_chunks)

Run the API with the following command:

uvicorn rag_api:app --host 0.0.0.0 --port 9000

You can test it with the following curl command:

curl http://127.0.0.1:9000/rag/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is this project about?"}'

You should get JSON with the answer and the context array.

FAQs

What is a RAG architecture?

Retrieval-Augmented Generation (RAG) is an AI architecture that combines a vector database, retriever, and generative model. When a user asks a question, the system finds the most relevant documents and sends them to the LLM to generate accurate and well-informed answers.

Can I host embeddings and LLM models on separate GPUs?

Yes. This is recommended for performance:
GPU 0: Embeddings (TEI)
GPU 1: LLM (vLLM)
This prevents resource contention and increases throughput.

What models are recommended for embeddings?

Most common high-performance embedding models include BAAI/bge-large-en-v1.5, gte-large, and nomic-embed-text. All perform exceptionally when served through HuggingFace TEI.

Conclusion

Building a Retrieval-Augmented Generation (RAG) system with GPU-accelerated embeddings and model serving gives you the best combination of speed, accuracy, and scalability.

By using Hugging Face TEI for embeddings, vLLM for high-speed LLM serving, and Qdrant for vector search, you can create a modern, production-ready system that handles real workloads efficiently.

This RAG Pipelines Using GPU Servers Setup allows you to:

Quickly ingest large knowledge bases.
Get fast, low-latency LLM responses.
Make full use of your GPU resources.
Build a flexible and extendable retrieval pipeline.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles on GPU hosting.

For further reading:

Explore GPU Hosting Alternatives to AWS for AI Training

GPU-Optimized RAG Pipelines for Fast and Scalable AI

GPU-Optimized RAG Pipelines for Fast and Scalable AI

Building RAG Pipelines Using GPU Servers

Prerequisites for Setting Up a Fully Self-hosted RAG System

Deploy a GPU-Powered Embedding Service with Text Embedding Inference (TEI)

Deploy a High-Performance GPU LLM Server with vLLM

Start Qdrant Vector Database

Build the RAG System with Python: Ingestion and Query

Create the RAG Ingestion Service

Create the RAG Query Service

FAQs

What is a RAG architecture?

Can I host embeddings and LLM models on separate GPUs?

What models are recommended for embeddings?

Conclusion

Post Your Comment

Navigation

Useful Links

Contact us

GPU-Optimized RAG Pipelines for Fast and Scalable AI

GPU-Optimized RAG Pipelines for Fast and Scalable AI

Building RAG Pipelines Using GPU Servers

Prerequisites for Setting Up a Fully Self-hosted RAG System

Deploy a GPU-Powered Embedding Service with Text Embedding Inference (TEI)

Deploy a High-Performance GPU LLM Server with vLLM

Start Qdrant Vector Database

Build the RAG System with Python: Ingestion and Query

Create the RAG Ingestion Service

Create the RAG Query Service

FAQs

What is a RAG architecture?

Can I host embeddings and LLM models on separate GPUs?

What models are recommended for embeddings?

Conclusion

Tags :

Post Your Comment

Navigation

Useful Links

Contact us