Build High-Performance OpenAI Compatible Hosting
Running large language models (LLMs) on your own GPU gives you full control, better privacy, and can save money in the long run. This guide will show you how to set up an OpenAI Compatible API server using your own hardware or a hosting provider like PerLod Hosting, which allows any application that works with OpenAI’s API to connect to your local models instead.
When we say a model server is “OpenAI-compatible,” it means the server uses the same format and endpoints as OpenAI’s official API. This allows you to use the same code, tools, and applications you would use with OpenAI, but pointed at your own server instead.
Applications like LangChain, LibreChat, and Open WebUI can connect to your local server without modification.
Table of Contents
Hardware and Software Requirements for OpenAI Compatible Hosting
The most important setup is your GPU. If you don’t want to manage the hardware yourself, you can also use AI hosting from PerLod, which provides ready-to-use GPU servers already configured for OpenAI-compatible model serving.
Here is what you need for OpenAI Compatible Hosting:
1. Operating system: Ubuntu 22.04 or Ubuntu 24.04 LTS.
2. GPU: NVIDIA GPU with compute capability 7.0 or higher.
3. GPU Memory (VRAM): Depends on model size.
- Small models (3B-7B parameters): 8-12GB VRAM
- Medium models (13B-20B parameters): 16-24GB VRAM
- Large models (70B+ parameters): 48GB+ VRAM or multiple GPUs
4. System RAM: At least 16GB.
5. NVIDIA Drivers: Version 535 or newer.
6. CUDA: Version 11.8 or 12.1+.
Popular GPU Choices include:
- Budget: NVIDIA RTX 3060 (12GB), RTX 3080 (10-12GB)
- Mid-range: RTX 3090/4090 (24GB), RTX 4080 (16GB)
- Professional: A100 (40-80GB), H100 (80GB), L40S (48GB)
Software Requirements:
- Python 3.9 or newer
- pip (Python package manager)
- NVIDIA CUDA toolkit
- Git (for downloading models)
Prepare System for Setting Up OpenAI Compatible API Server
You must ensure your system is up to date. Run the system update and upgrade with the commands below:
sudo apt update
sudo apt upgrade -y
Then, install Python and essential packages with the following command:
sudo apt install python3-pip python3-venv git -y
Verify your NVIDIA driver with:
nvidia-smi
In your output, you must see your GPU information and driver version. If it doesn’t work, you need to install NVIDIA drivers first.
Check CUDA version with:
nvcc --version
If CUDA is not installed, download it from NVIDIA’s website and follow the installation instructions provided.
Here we want to set up a Python environment that keeps your installation clean and prevents conflicts with other software.
Create a new virtual environment with the following commands:
mkdir ~/llm-server
cd ~/llm-server
python3 -m venv venv
Activate the environment with the command below:
source venv/bin/activate
You should see (venv) appear at the start of your command prompt.
Once you are done, proceed to the following steps to install your LLM model and set up the OpenAI Compatible API Server.
Install vLLM for OpenAI Compatible Serving
vLLM is a fast and efficient library for serving large language models, which provides an OpenAI compatible API server.
Install vLLM inside your Python ENV shell with the command below:
pip install vllm
It will take a few minutes to download and install all necessary components.
Verify the vLLM installation by checking its version:
vllm --version
Choose and Download a Model for OpenAI Compatible Hosting
Models are typically downloaded from Hugging Face, which is a platform that hosts thousands of open-source language models.
Popular Open-Source Models for 2025 based on your hardware include:
For 8-12GB VRAM:
- meta-llama/Llama-3.2-3B-Instruct (3 billion parameters)
- google/gemma-2-2b-it (2 billion parameters)
- mistralai/Mistral-7B-Instruct-v0.3 (7 billion parameters)
For 16-24GB VRAM:
- meta-llama/Llama-3.1-8B-Instruct (8 billion parameters)
- mistralai/Mistral-Small-3.2-24B-Instruct (24 billion parameters)
- Qwen/Qwen2.5-14B-Instruct (14 billion parameters)
For 40GB+ VRAM or Multi-GPU:
- meta-llama/Llama-3.1-70B-Instruct (70 billion parameters)
- openai/gpt-oss-120b (117 billion parameters)
- Qwen/Qwen2.5-72B-Instruct (72 billion parameters)
Note: Models will automatically download when you first use them, or you can download them manually using the Hugging Face CLI.
To install Hugging Face CLI, you can run:
pip install huggingface-hub
Then, download a model manually with the example below:
huggingface-cli download meta-llama/Llama-3.2-3B-Instruct
Start the OpenAI Compatible Serving
At this point, you can easily start your server. vLLM provides a simple command to launch an OpenAI-compatible API server.
For a basic start, you can run:
vllm serve meta-llama/Llama-3.2-3B-Instruct --dtype auto
In this command:
- The “vllm serve” launches the OpenAI-compatible server.
- meta-llama/Llama-3.2-3B-Instruct is the model to load. You can change this to your chosen model.
- –dtype auto will automatically select the best data type for your GPU.
The server will start on http://localhost:8000 by default.
You can also customize your server with additional options.
For security, you can add an API key to your command:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--dtype auto \
--api-key your-secret-key-here
Change the host and port with the command below:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--dtype auto \
--host 0.0.0.0 \
--port 8000
Using –host 0.0.0.0 allows connections from other devices on your network, not just localhost.
Adjust GPU memory usage by using this:
vllm serve meta-llama/Llama-3.2-3B-Instruct \
--dtype auto \
--gpu-memory-utilization 0.9
This tells vLLM to use up to 90% of your GPU memory.
For multiple GPUs, you can use tensor parallelism:
vllm serve meta-llama/Llama-3.1-70B-Instruct \
--dtype auto \
--tensor-parallel-size 2
The –tensor-parallel-size should match the number of GPUs you want to use. For example, if you have 4 GPUs, set it to 4.
Here is an example with all options included:
vllm serve mistralai/Mistral-7B-Instruct-v0.3 \
--dtype auto \
--host 0.0.0.0 \
--port 8000 \
--api-key sk-your-secret-key \
--gpu-memory-utilization 0.9 \
--max-model-len 4096
The –max-model-len sets the maximum context length means how much text the model can process at once.
Test OpenAI Compatible Serving
Once your server is up and running, you can test it to make sure everything works correctly.
You can use the following curl command in a new terminal, which keeps the server running in the first one:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-your-secret-key" \
-d '{
"model": "meta-llama/Llama-3.2-3B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! Can you explain what you are?"}
],
"max_tokens": 100,
"temperature": 0.7
}'
Note: If you didn’t set an API key, remove the Authorization header line. If you set a different API key, replace sk-your-secret-key with your actual key.
You should receive a JSON response with the model’s answer.
Also, you can test your server using Python with the OpenAI library. Install the OpenAI Python library with the command below:
pip install openai
Create a test script file with the command below:
nano test_api.py
Add the following script to the file:
from openai import OpenAI
# Point the client to your local server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key" # Use your actual key, or any string if no key is set
)
# Make a chat completion request
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct", # Must match the model you're serving
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=100,
temperature=0.7
)
print(response.choices[0].message.content)
Then, run the script:
python test_api.py
If everything is working, you’ll see the model’s response printed to your terminal.
Use OpenAI-compatible API Server with Existing Applications
One of the amazing features of an OpenAI-compatible API is that you can use it with any tool that supports OpenAI.
For example, you can use it with LangChain, which is a popular framework for building LLM applications. Here’s how to connect it to your local server:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url="http://localhost:8000/v1",
api_key="sk-your-secret-key",
model="meta-llama/Llama-3.2-3B-Instruct"
)
response = llm.invoke("Tell me a joke about programming")
print(response.content)
Another example is to use it with Open WebUI, which is a user-friendly interface for chatting with LLMs. To connect it to your local server:
1. Install Docker and Docker Compose.
2. Set up Open WebUI following their documentation.
3. In Open WebUI settings, go to Admin Settings, Connections, and OpenAI.
4. Add a new connection:
- API URL: http://localhost:8000/v1
- API Key: Your key (or leave empty if you didn’t set one)
5. Save and start chatting.
Also, you can use it with Environment Variables. Many applications read OpenAI credentials from environment variables. You can set these to point to your local server:
export OPENAI_API_BASE=http://localhost:8000/v1
export OPENAI_API_KEY=sk-your-secret-key
Now any application that uses these environment variables will automatically connect to your local server instead of OpenAI.
Run vLLM OpenAI-Compatible API Server as a System Service
To keep your server running even after you close your terminal or reboot, you can set it up as a system service.
Create a service file with the command below:
sudo nano /etc/systemd/system/vllm-server.service
Add the following content to the file with your paths and settings to match your setup:
[Unit]
Description=vLLM OpenAI-Compatible API Server
After=network.target
[Service]
Type=simple
User=your-username
WorkingDirectory=/home/your-username/llm-server
Environment="PATH=/home/your-username/llm-server/venv/bin"
ExecStart=/home/your-username/llm-server/venv/bin/vllm serve meta-llama/Llama-3.2-3B-Instruct --dtype auto --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
Save the file and enable the service with the commands below:
sudo systemctl daemon-reload
sudo systemctl enable vllm-server
sudo systemctl start vllm-server
Check the service status with the following command:
sudo systemctl status vllm-server
View logs with:
sudo journalctl -u vllm-server -f
Now your server will start automatically when your system boots.
Alternative Option: Use Ollama for OpenAI Compatible Hosting
If you prefer a simpler setup with a user-friendly interface, you can consider using Ollama, which is easier to set up but offers less customization than vLLM.
Install Ollama with the command below:
curl -fsSL https://ollama.com/install.sh | sh
Run a model with Ollama:
ollama run llama3.2:3b
Then, expose as an OpenAI-compatible API. Ollama automatically provides an OpenAI-compatible API on port 11434. To use it, you can run:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # Ollama doesn't require an API key
)
response = client.chat.completions.create(
model="llama3.2:3b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Ollama is great for getting started quickly, but vLLM offers better performance and more configuration options for production use.
Fix Common Issues in OpenAI Compatible Serving
During setting up OpenAI-compatible API Server, you may encounter some errors and issues. Here are the most common issues you may face and their solutions:
1. CUDA out of memory error: The model is too large for your GPU. Try:
- Using a smaller model.
- Reducing “–max-model-len” to decrease context size.
- Lowering “–gpu-memory-utilization” to 0.8 or 0.7.
- Using quantized models such as 4-bit or 8-bit versions.
2. Server is slow or unresponsive error:
- Check if you have enough VRAM.
- Ensure no other programs are using the GPU.
- Try reducing batch size or concurrent requests.
- Make sure you’re using –dtype auto for optimal performance.
3. Cannot connect from other devices:
- Use “–host 0.0.0.0” instead of localhost.
- Check your firewall settings:
sudo ufw allow 8000. - Verify the server is running:
curl http://localhost:8000/v1/models
4. Model downloading is very slow:
- Models can be 10-50GB in size, so downloads take time.
- Consider downloading manually first using huggingface-cli download.
- Check your internet connection.
- Use –download-dir to specify a download location with enough space.
Performance Optimization Tips for OpenAI Compatible Serving
Good performance makes your local AI server feel fast and stable for every request. Here are some simple ways to speed up your OpenAI‑compatible server, like using more than one GPU, turning on flash attention, changing batch size, and choosing light (quantized) models.
These tips help you get more speed from your GPU and avoid common memory errors while serving many users.
1. Use tensor parallelism for large models: If you have multiple GPUs, use –tensor-parallel-size to split the model across them.
2. Enable flash attention for faster inference:
vllm serve model-name --dtype auto --enable-flash-attention
3. Adjust batch size for high throughput:
vllm serve model-name --dtype auto --max-num-seqs 256
4. Use quantized models: 4-bit or 8-bit quantized models use less memory and run faster with minimal quality loss. Look for models with “GPTQ,” “AWQ,” or “GGUF” in their name.
5. Monitor GPU usage: Keep nvidia-smi running in a separate terminal to watch GPU utilization and memory usage.
Tip: To get a complete setup for enabling Flash Attention, you can check this guide on Hosting Models with FlashAttention.
Cost Comparison for OpenAI Compatible Serving
Running your own models can be much cheaper than using cloud APIs when you send a lot of requests every month. With cloud providers, you pay for every million tokens you send and receive, so your bill grows as your traffic grows.
With a self‑hosted setup, you mainly pay once for the GPU hardware and then for power and cooling, while there are no per‑token fees. This means that after the initial hardware cost, busy workloads can become far more cost‑effective on your own server.
In addition to running your own hardware, you can also choose GPU dedicated servers, which offer high-performance GPUs. These servers give you the benefits of self-hosting without the upfront hardware purchase.
OpenAI GPT-4 Pricing:
- Input: $2.50 per million tokens
- Output: $10.00 per million tokens
Self-Hosted Costs:
- One-time GPU purchase: $500–$5,000, depending on the GPU.
- Or GPU dedicated servers: monthly or hourly rental (no per-token fees), ideal if you want full control without buying hardware.
FAQs
Do I need a GPU to run my own OpenAI-compatible server?
Yes. Although small models can run on modest GPUs (8–12GB VRAM), larger models require 24GB, 48GB, or even multiple GPUs. If you don’t own a GPU, you can use GPU dedicated servers or AI hosting from PerLod, which provides preconfigured environments ready for OpenAI-compatible serving.
Can I use environment variables like with OpenAI?
Yes. Any tool that uses OpenAI’s environment variables will automatically connect to your local or hosted server.
What is an OpenAI-compatible API server?
An OpenAI-compatible API server uses the same API format, endpoints, and request structure as OpenAI’s official API. This allows any application to connect to your own self-hosted model instead of OpenAI with no code changes.
Conclusion
Hosting your own OpenAI compatible API server gives you full control over your AI infrastructure, reduces costs, and improves privacy. By using tools like vLLM or Ollama, you can run powerful language models directly on your GPU, integrate them into existing applications, and customize performance in ways that cloud providers do not allow.
We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest updates and articles on GPU and AI Hosting.