Everything You Need to Know Before Running a 70B Model on One Server
Can you run a 70B model on one server? The answer is yes, but only with the right conditions. The hardware you choose, the precision format you use, and whether you’re serving one user or many will all determine whether that single server runs smoothly or not.
This guide explores everything you need to plan your setup correctly, from VRAM budgets to real configuration examples.
Table of Contents
Understanding 70B Model Size Before You Buy Hardware
A 70B model has 70 billion parameters, and each parameter takes up memory, and the amount depends on the numeric format (precision) you use to store it.
Here are the quantization formats, different ways of storing a model’s parameters (weights) in memory:
- FP16 (16-bit): 2 bytes per parameter: ~140 GB VRAM needed.
- INT8 (8-bit): 1 byte per parameter: ~70 GB VRAM needed.
- INT4 or 4-bit (Q4): ~0.5 bytes per parameter: ~35 to 42 GB VRAM needed.
This is the foundation of your planning. Before anything else, you must decide which format fits your quality and budget goals.
These numbers give you a starting point, but VRAM needs to also change based on your context length, batch size, and whether you’re running inference or fine-tuning. For a full breakdown, check out the VRAM guide for LLMs.
Can You Run a 70B Model on One Server With FP16?
Running a 70B model at full FP16 precision on one server is technically possible, but it needs the right hardware. You need at least 140 GB of VRAM, which means a single A100 80GB card won’t be enough; you need at least two A100 80GB or two H100 80GB GPUs connected via NVLink.
This setup is best for research labs or production environments, where they need maximum quality.
Who should use FP16:
- Production APIs serving thousands of users.
- Research requiring the highest accuracy.
- Fine-tuning pipelines, not just inference.
Running a 70B Model With INT8: What to Expect
With INT8 quantization, each parameter drops to 1 byte, which brings the total model weight to ~70 GB. This means a single NVIDIA A100 80GB or H100 80GB can technically hold the model, though with very little space left for context (the KV cache).
INT8 on a single 80GB GPU works for light single-user use, but throughput is limited. For multi-user deployments, you still need at least two GPUs to leave space for the KV cache at longer context windows.
Why Most Teams Choose 4-Bit for 70B Deployments
Most teams that want to run a 70B model on one server choose 4-bit quantization such as GPTQ, AWQ, or GGUF Q4_K_M. At this level, the model weights shrink to about 35 to 42 GB, which fits comfortably on a single NVIDIA L40S (48 GB) or a dual RTX 4090 (24 GB × 2) setup.
In most tests, 4-bit models perform nearly as well as FP16 for everyday chat and reasoning tasks.
Best hardware choices for 4-bit single-server inference:
| GPU | VRAM | Can Hold 70B Q4? | Notes |
|---|---|---|---|
| NVIDIA H100 80GB | 80 GB | ✅ Yes, with room | Best single-card option |
| NVIDIA A100 80GB | 80 GB | ✅ Yes, tight | Limited KV cache space |
| NVIDIA L40S | 48 GB | ✅ Yes | Great price/performance |
| 2× RTX 4090 | 48 GB total | ✅ Yes | Consumer-grade, slower |
| Single RTX 4090 | 24 GB | ❌ No (partial offload only) | CPU offload needed |
How Much Does Context Length Affect VRAM Usage?
VRAM estimates above are for model weights only. In real use, the KV cache, which stores conversation context, incurs additional memory overhead. A longer context window means more VRAM used per request.
- At 2K context, a Q4 70B model on a 48GB GPU works fine.
- At 8K context, that same card starts getting tight.
- At 32K+ context, you need 80GB+ VRAM even at 4-bit.
If your use case involves long documents, multi-turn chat, or RAG pipelines with large chunks, plan for at least 80 GB VRAM even if you’re using 4-bit quantization.
Difference Between Single-User vs. Multi-User Inference
This is where many teams make mistakes. A server that handles one user at a time has very different needs from one serving a team of 10+ concurrent users.
Single-user inference, Q4, H100 80GB:
- Throughput: ~24–33 tokens per second.
- Response feels near real-time.
- One H100 or one A100 80GB is usually enough.
Multi-user, FP16 or INT8, multi-GPU:
- A 4× H100 setup peaks at ~7,000 tokens per second across 500 concurrent users.
- A 4× A100 setup saturates at ~570 tokens per second under load.
- H100 offers about 12 to 14× higher throughput than A100 under concurrent load.
If you’re building an internal API or a product that multiple users query at the same time, plan for at least 2× the VRAM you think you need to maintain acceptable latency.
Server Configuration Examples for Running a 70B Model
Planning a server can feel overwhelming with so many options. To make it easier, here are three ready-to-use configurations based on real workloads, one for testing, one for small production, and one for high-traffic deployments.
Setup 1. Budget Single-User Server, 4-bit, low traffic:
- GPU: 1× NVIDIA A100 80GB
- RAM: 128 GB system RAM
- Storage: NVMe SSD 1 TB+
- Format: Q4_K_M, GGUF via Ollama or llama.cpp
- Expected output: ~18 to 25 tokens per second
- Best for: Internal tools, personal assistants, dev or testing
Setup 2. Production Single Server, 8-bit, moderate traffic:
- GPU: 1 to 2× NVIDIA H100 80GB
- RAM: 256 GB system RAM
- Storage: NVMe RAID
- Format: INT8 via vLLM or TGI
- Expected output: ~30 to 50 tokens per second per user, ~100 concurrent users
- Best for: Small API deployments, SaaS products, and startup MVPs
Setup 3. High-Throughput Single Server, FP16, team-scale
- GPU: 2 to 4× NVIDIA H100 80GB NVLink
- RAM: 512 GB system RAM
- Storage: NVMe SSD RAID
- Format: BF16 via vLLM with tensor parallelism
- Expected output: 1,000 to 7,000 total TPS across concurrent users
- Best for: Enterprise APIs, research, and high-traffic products
Tip: You can see PerLod GPU configurations for large-model inference to find the right server for your 70B deployment. Also, PerLod’s AI hosting plans cover managed inference setups, so you don’t have to handle everything manually.
Conclusion
Can you run a 70B model on one server? The question is which precision and how many users you need to support. For a single user or small team, a 4-bit model on a single H100 or A100 80GB card does the job well. For larger workloads, you’ll need either higher precision or more GPU memory.
Plan for context size and user load upfront; that’s what makes or breaks a single-server setup.
We hope you enjoy this guide.
FAQs
Can you run a 70B model on one server without a GPU?
Yes, using CPU-only inference with llama.cpp, but you’ll need 64 to 256 GB of RAM, and speeds will be very slow.
What’s the best framework to run a 70B model on one server?
vLLM is best for multi-user APIs and production workloads, while Ollama is easier to set up for single-user use. You can read this full Ollama vs vLLM comparison.
Can a single RTX 4090 run a 70B model?
Not fully. You can use CPU and GPU hybrid offloading, but throughput drops. Two RTX 4090s work much better.