Fine-Tuning Llama 3 Models on GPU VPS

Llama 3 Fine-Tuning on GPU VPS

Fine-Tuning Llama 3 Models on GPU VPS

Llama 3 Fine-Tuning on GPU VPS is one of the most practical ways to build a custom LLM without giving up control of your data, costs, or deployment environment. Instead of relying on fully managed platforms, you can run the entire pipeline directly on your own GPU instance, with predictable performance and repeatable results.

In this guide, you will learn to set up a workflow for fine-tuning Llama 3 using modern, resource-efficient methods designed for GPU Dedicated servers.

Prerequisites and Hardware Requirements for Llama 3 Fine-Tuning on GPU VPS

Before beginning the fine-tuning of Llama 3 models, you must ensure your GPU VPS meets these minimum specifications:

Minimum Requirements for Llama 3 8B include:

  • GPU: NVIDIA GPU with at least 16-24GB VRAM, such as RTX 4090 or higher.
  • RAM: 32GB DDR5, 64GB recommended.
  • Storage: 100GB+ free space for model, datasets, and dependencies.
  • OS: Ubuntu 22.04 LTS or newer.
  • CUDA: Version 12.1 or higher

For Llama 3 70B with QLoRA:

  • GPU: 48GB+ VRAM, such as RTX 6000 Ada, A100, or H100.
  • RAM: 128GB+.

If you need a high-performance GPU VPS, you can check PerLod Hosting, which offers the best plans for your Llama 3 Fine-Tuning on GPU VPS.

The key point of efficient fine-tuning is to use Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA and QLoRA, which reduce memory requirements while maintaining model quality.​

Prepare GPU VPS for Fine-Tuning Llama 3

Before you can fine-tune Llama 3, you need to prepare your GPU VPS with the right software stack. You must install CUDA drivers for GPU acceleration, set up Python with all necessary deep learning libraries, and configure your environment for optimal performance.

Run the system update and upgrade with the command below:

sudo apt update && sudo apt upgrade -y

Install the NVIDIA CUDA Toolkit with the following commands:

# Add NVIDIA package repositories
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update

# Install CUDA Toolkit
sudo apt install cuda-toolkit-12-4 -y

Once your installation is completed, verify GPU utilization with the command below:

nvidia-smi

Then, create a Python Virtual Environment to isolate your project dependencies:

# Install Python and venv
sudo apt install python3.11 python3.11-venv python3-pip -y

# Create virtual environment
python3.11 -m venv llama3_finetune_env

# Activate environment
source llama3_finetune_env/bin/activate

From your Python environment shell, install PyTorch with CUDA support and essential libraries with the following commands:

# Upgrade pip
pip install --upgrade pip

# Install PyTorch with CUDA 12.1 support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Verify PyTorch CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA version: {torch.version.cuda}')"

Also, you can install Unsloth, which provides 2-5x faster training with reduced memory usage:

pip install unsloth

You can use the alternative Unsloth installation method, which gives you the latest development version:

pip install unsloth
pip uninstall unsloth unsloth_zoo -y
pip install --no-deps git+https://github.com/unslothai/unsloth_zoo.git
pip install --no-deps git+https://github.com/unslothai/unsloth.git

Finally, install the required libraries and training tools with the commands below:

# Install Hugging Face libraries and training tools
pip install transformers datasets accelerate peft bitsandbytes trl
pip install scipy sentencepiece protobuf

# Install monitoring tools
pip install tensorboard wandb

Note: For a more comprehensive GPU environment setup guide, you can check this guide on LoRA Model Training on Budget GPUs.

Access and Prepare Llama 3 Model

Llama 3 is a gated model, which Meta requires you to request access to before downloading it. Once it is accepted, you will authenticate with Hugging Face and download the model to your VPS.

Visit the Meta Llama website, request access, and accept the terms and conditions.

Then, visit the Hugging Face Llama 3 model page and request access and wait for approval.

After approval, you can authenticate with Hugging Face:

# Install Hugging Face CLI
pip install huggingface-hub

# Login with your token
huggingface-cli login

Use the command below to create a test script to verify model access:

from unsloth import FastLanguageModel
import torch

# Test loading Llama 3 8B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length=2048,
    dtype=None,  # Auto-detect optimal dtype
    load_in_4bit=True,  # Use 4-bit quantization
)

print("Model loaded successfully!")
print(f"Model device: {model.device}")
print(f"Model dtype: {model.dtype}")

Prepare Your Custom Training Dataset for Llama 3

Fine-tuning Llama 3 depends entirely on the quality and format of your training data. Llama 3 expects a very specific chat template format with special tokens that mark different parts of the conversation.

At this point, you will learn how to structure your dataset correctly, whether you’re creating a small custom dataset or loading a larger dataset from a file.

Your dataset should follow this structure:

from datasets import Dataset

# Llama 3 chat template
TEMPLATE = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>"""

# Example dataset
data = [
    {
        "text": TEMPLATE.format(
            system_message="You are a helpful AI assistant specialized in technical documentation.",
            user_message="How do I install Docker on Ubuntu?",
            assistant_message="Here's how to install Docker on Ubuntu:\n\n1. Update package index: sudo apt update\n2. Install prerequisites: sudo apt install apt-transport-https ca-certificates curl software-properties-common\n3. Add Docker's GPG key: curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -\n4. Add Docker repository: sudo add-apt-repository \"deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable\"\n5. Install Docker: sudo apt update && sudo apt install docker-ce\n6. Verify installation: sudo docker run hello-world"
        )
    },
    {
        "text": TEMPLATE.format(
            system_message="You are a helpful AI assistant specialized in technical documentation.",
            user_message="What is Kubernetes?",
            assistant_message="Kubernetes is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications. It groups containers into logical units for easy management and discovery."
        )
    },
    # Add more examples...
]

# Create dataset
dataset = Dataset.from_list(data)
dataset = dataset.train_test_split(test_size=0.1)  # 90% train, 10% validation

print(f"Training samples: {len(dataset['train'])}")
print(f"Validation samples: {len(dataset['test'])}")

Explanation:

  • The template includes special tokens required by Llama 3’s instruction format.
  • <|begin_of_text|>: Marks the start of conversation.
  • <|start_header_id|>…<|end_header_id|>: Wraps role identifiers.
  • <|eot_id|>: End-of-turn token separating messages.
  • train_test_split: Automatically creates training and validation sets.

For larger datasets stored in JSON format, you can use:

from datasets import load_dataset

# Load from local JSON file
dataset = load_dataset('json', data_files={
    'train': 'path/to/train.jsonl',
    'validation': 'path/to/val.jsonl'
})

# Or load from Hugging Face Hub
dataset = load_dataset("mlabonne/guanaco-llama2-1k")  # Example dataset

JSON format example (train.jsonl):

{"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nHi there!<|eot_id|>"}
{"text": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are helpful.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHow are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI'm doing well!<|eot_id|>"}

Each line must contain a text field with the formatted conversation.

Configure LoRA Parameters for Efficient Fine-Tuning Llama 3

Instead of training all 8 billion parameters, LoRA adds small adapter matrices to specific layers, which reduces trainable parameters from 8B to just 40-50 million, while maintaining quality.

In this step, you will learn what each LoRA hyperparameter does and how to configure them for your specific needs.

LoRA Rank (r):

  • Controls the dimensionality of adapter matrices.
  • Higher rank = more expressiveness but more memory usage.
  • Recommended values: r=8, 16, or 32 for most tasks.
  • Research shows even r=4 can perform competitively.

LoRA Alpha (α):

  • Scaling factor that controls adaptation strength.
  • Recommended: α = 2 × r or α = r.
  • Formula: scaling = α / r, so α=16 with r=8 gives scaling of 2.

Target Modules: Which layers to apply LoRA adapters to.

Options for Llama 3:

  • q_proj, v_proj: Query and Value projections (minimum, memory-efficient).
  • q_proj, k_proj, v_proj, o_proj: All attention layers (recommended)​.
  • All linear layers: Maximum performance but higher memory usage.

LoRA Dropout:

  • Regularization to prevent overfitting.
  • Set to 0 for small datasets (no overfitting risk)​.
  • Set to 0.05-0.1 for larger datasets.

Configure LoRA Adapters with:

from unsloth import FastLanguageModel
from peft import LoraConfig, get_peft_model

# Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Prepare model for PEFT
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=[
        "q_proj",  # Query projection
        "k_proj",  # Key projection
        "v_proj",  # Value projection
        "o_proj",  # Output projection
        "gate_proj",  # Gate projection (FFN)
        "up_proj",  # Up projection (FFN)
        "down_proj",  # Down projection (FFN)
    ],
    lora_alpha=32,  # 2 × rank
    lora_dropout=0,  # No dropout for small datasets
    bias="none",  # Don't train bias parameters
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
    random_state=3407,  # For reproducibility
    use_rslora=True,  # Rank-Stabilized LoRA for better scaling
    loftq_config=None,  # No LoftQ quantization
)

# Print trainable parameters
trainable, total = model.get_nb_trainable_parameters()
print(f"Trainable parameters: {trainable:,} ({100 * trainable / total:.2f}%)")
print(f"Total parameters: {total:,}")

Command explanation:

  • r=16, α=32: Balanced configuration providing good performance​.
  • target_modules: Includes attention and FFN layers for comprehensive adaptation​.
  • use_gradient_checkpointing: Reduces memory by recomputing activations during the backward pass​.
  • use_rslora=True: Implements Rank-Stabilized LoRA with scaling factor 1/√r instead of 1/r​.
  • Expected trainable parameters: ~0.5-1% of total (for example, 40M out of 8B).

Configure LoRa Training Arguments and Initialize the Trainer

Training arguments control everything, including how many samples to process at once (batch size), how much to adjust weights each step (learning rate), how often to save progress, and dozens of other parameters that affect both training speed and final model quality.

1. Understanding Training Hyperparameters

The effective batch size is calculated as:

Effective Batch Size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus

This simulates larger batch sizes without requiring more VRAM.

  • per_device_train_batch_size: Samples processed per step per GPU (typically 1-8)​.
  • gradient_accumulation_steps: Number of steps to accumulate gradients before updating weights​.

For example:

per_device_train_batch_size=4 + gradient_accumulation_steps=8 = effective batch size of 32

This uses the memory of batch size 4 but has the training stability of batch size 32.

Learning Rate:

  • Controls how much weights are updated each step.
  • Recommended for LoRA: 1e-4 to 5e-5​.

Learning Rate Scheduler:

  • Cosine: Gradually decreases learning rate following a cosine curve (recommended).
  • Linear: Linear decrease from peak to 0.
  • Constant: No change (not recommended).

Warmup:

  • Gradually increases the learning rate from 0 to the target over the first N steps.
  • Prevents initial instability.
  • Recommended: 5-10% of total training steps or 100-500 warmup steps.

2. Define Training Configuration:

from transformers import TrainingArguments
from trl import SFTTrainer

# Define training arguments
training_args = TrainingArguments(
    # Output and logging
    output_dir="./llama3_finetuned",
    run_name="llama3-8b-custom",
    
    # Training parameters
    per_device_train_batch_size=4,  # Batch size per GPU
    gradient_accumulation_steps=8,  # Effective batch = 4 × 8 = 32
    num_train_epochs=3,  # Number of complete passes through dataset
    max_steps=-1,  # -1 means train for full epochs
    
    # Learning rate
    learning_rate=2e-4,  # Higher than typical fine-tuning due to LoRA
    lr_scheduler_type="cosine",  # Cosine annealing
    warmup_ratio=0.05,  # Warmup for 5% of training
    
    # Optimization
    optim="adamw_8bit",  # 8-bit AdamW optimizer for memory efficiency
    weight_decay=0.01,  # L2 regularization
    max_grad_norm=1.0,  # Gradient clipping
    
    # Precision
    fp16=False,  # Don't use FP16 (can be unstable)
    bf16=True,  # Use BF16 mixed precision (recommended for Ampere+ GPUs)
    
    # Evaluation and saving
    eval_strategy="steps",  # Evaluate every N steps
    eval_steps=50,  # Evaluate every 50 steps
    save_strategy="steps",
    save_steps=100,  # Save checkpoint every 100 steps
    save_total_limit=3,  # Keep only last 3 checkpoints
    
    # Logging
    logging_steps=10,  # Log metrics every 10 steps
    logging_dir="./logs",
    report_to="tensorboard",  # Use tensorboard for visualization
    
    # Performance
    dataloader_num_workers=4,  # Parallel data loading
    group_by_length=True,  # Group similar length samples for efficiency
    
    # Reproducibility
    seed=3407,
)

3. Initialize SFTTrainer:

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    args=training_args,
    dataset_text_field="text",  # Field containing formatted text
    max_seq_length=2048,  # Maximum sequence length
    packing=False,  # Don't pack multiple samples into one sequence
)

Start LoRa Training and Monitor Progress

At this point, you can launch the training process and watch your model learn.

Launch Training with:

# Start training
print("Starting training...")
trainer.train()

# Save final model
trainer.save_model("./llama3_finetuned_final")

Open a second terminal and monitor GPU utilization with the command below:

# Monitor GPU every 1 second
watch -n 1 nvidia-smi

GPU Utilization should be 80-100% during training.

Alternative monitoring tools include:

# Detailed utilization logging
nvidia-smi -l 1  # Loop every 1 second

# CSV logging for later analysis
nvidia-smi --query-gpu=timestamp,utilization.gpu,utilization.memory,memory.used,memory.total,temperature.gpu --format=csv -l 1 > gpu_log.csv

Install nvtop for better visualization:

sudo apt install nvtop
nvtop

You can also monitor training metrics with TensorBoard. In a third terminal, run the commands below:

# Navigate to project directory
cd /path/to/your/project

# Launch TensorBoard
tensorboard --logdir=./logs --port=6006

# Access in browser at: http://localhost:6006

Test Llama 3 Fine-Tuned Model

After training, the next step is to confirm that your fine-tuned model actually behaves better for your target task.

At this point, you can load the saved model for inference, run a few test prompts with controlled generation settings, and then compare outputs against the original base model to clearly see what changed.

Load Fine-Tuned Model for Inference:

from unsloth import FastLanguageModel
import torch

# Load fine-tuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./llama3_finetuned_final",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Prepare for inference (disables gradient computation)
FastLanguageModel.for_inference(model)

# Test prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant specialized in technical documentation.<|eot_id|><|start_header_id|>user<|end_header_id|>

How do I monitor GPU usage during training?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    max_new_tokens=512,  # Maximum tokens to generate
    temperature=0.7,  # Randomness (0.0 = deterministic, 1.0 = creative)
    top_p=0.9,  # Nucleus sampling
    top_k=50,  # Top-k sampling
    repetition_penalty=1.1,  # Penalize repetition
    do_sample=True,  # Enable sampling
)

# Decode and print
response = tokenizer.decode(outputs[0], skip_special_tokens=False)
print(response)

Compare Base vs Fine-Tuned Model:

def compare_models(prompt, base_model_path, finetuned_model_path):
    """Compare outputs from base and fine-tuned models"""
    
    results = {}
    
    for model_name, model_path in [("Base", base_model_path), ("Fine-tuned", finetuned_model_path)]:
        # Load model
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_path,
            max_seq_length=2048,
            dtype=None,
            load_in_4bit=True,
        )
        FastLanguageModel.for_inference(model)
        
        # Generate
        inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = model.generate(
            **inputs,
            max_new_tokens=256,
            temperature=0.7,
            do_sample=True,
        )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        results[model_name] = response
        
        # Free memory
        del model
        torch.cuda.empty_cache()
    
    return results

# Test
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

Explain Docker in simple terms.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

results = compare_models(
    prompt=prompt,
    base_model_path="meta-llama/Meta-Llama-3-8B-Instruct",
    finetuned_model_path="./llama3_finetuned_final"
)

print("=== BASE MODEL ===")
print(results["Base"])
print("\n=== FINE-TUNED MODEL ===")
print(results["Fine-tuned"])

Deploy Llama 3 Fine-Tuned Model

In this step, you can merge LoRA weights into the base model and save a single Hugging Face model folder, export and quantize to GGUF for efficient CPU/GPU inference, and run the final model locally using Ollama for quick testing and serving.

Merge Adapters into Base Model:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load base model (full precision)
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Load and merge LoRA adapters
model = PeftModel.from_pretrained(base_model, "./llama3_finetuned_final")
merged_model = model.merge_and_unload()

# Save merged model
merged_model.save_pretrained("./llama3_merged")
tokenizer.save_pretrained("./llama3_merged")

print("Model merged and saved successfully!")

Note: Merging requires loading the full model in memory.

Quantize to GGUF Format for Efficient Deployment:

# Install llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# Convert to GGUF format
python convert.py ../llama3_merged --outtype f16 --outfile ../llama3_merged_f16.gguf

# Quantize to 4-bit (Q4_K_M is recommended)
./quantize ../llama3_merged_f16.gguf ../llama3_merged_Q4_K_M.gguf Q4_K_M

Deploy with Ollama:

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Create Modelfile
cat > Modelfile << EOF
FROM ./llama3_merged_Q4_K_M.gguf

TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER stop "<|eot_id|>"
EOF

# Import model
ollama create llama3-custom -f Modelfile

# Test model
ollama run llama3-custom "How do I monitor GPU usage?"

Conclusion

At this point, you have a complete pipeline for fine-tuning Llama 3 models on GPU VPS infrastructure. With the PEFT techniques, you can fine-tune Llama 3 8B on GPUs (16-24GB VRAM) and even scale to 70B models using QLoRA on high-end hardware.

We hope you enjoy this guide. Subscribe to our X and Facebook channels to get the latest articles on AI hosting.

For further reading:

Enterprise LLM Hosting Guide

Benchmark GPU for AI Training and Inference

Post Your Comment

PerLod delivers high-performance hosting with real-time support and unmatched reliability.

Contact us

Payment methods

payment gateway
Perlod Logo
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.