Skip to content

Fine-Tuning LLMs with LoRA: A Practical Guide

A hands-on guide to parameter-efficient fine-tuning — from dataset preparation and LoRA configuration to training loops, evaluation, and production deployment.

2025-08-20 10 min read
LLMFine-TuningPyTorchLoRA

Full fine-tuning of a 7B+ parameter model requires multiple high-end GPUs and days of training. LoRA (Low-Rank Adaptation) changes the equation entirely — you can fine-tune on a single GPU in hours, producing a lightweight adapter that slots into the frozen base model. Here's how I approach it in practice.

When to Fine-Tune vs. When to Prompt

Not every task needs fine-tuning. I use a simple decision framework: if you can describe the desired behavior in a prompt and the model consistently follows it, prompt engineering is sufficient. Fine-tune when you need consistent adherence to a specific format, style, or domain language that prompting alone can't reliably produce — or when you need to reduce latency by removing complex system prompts.

Dataset Preparation

The quality of your training data determines the ceiling of your fine-tune. I follow a three-stage pipeline: generate, score, and augment.

Stage 1: Generate Teacher Data

Use a stronger model (GPT-4, Claude) to generate (input, output) pairs for your task. Be specific in your prompts about format, length, and quality criteria. I typically generate 500-1000 initial examples.

python
# Generate training pairs with quality criteria
prompt = f"""Generate a training example for {task_description}.

Input: {sample_input}
Output requirements:
- Must follow {format_spec}
- Must include {quality_criteria}
- Must NOT include {negative_criteria}

Output:"""

response = teacher_model.generate(prompt, temperature=0.7)

Stage 2: Quality Scoring

Every generated example gets scored on a 0-5 rubric. I automate this with a scoring model that checks for format compliance, factual consistency, and task-specific criteria. Only examples scoring 4+ make it into the final training set.

Stage 3: Augmentation

Paraphrase augmentation doubles the effective dataset size. I also add edge cases — short inputs, ambiguous scenarios, adversarial prompts. The final split is 90% train, 10% eval.

LoRA Configuration

The key hyperparameters are rank (r), alpha, target modules, and dropout. Here's what I've found works across multiple projects.

python
from peft import LoraConfig

# Recommended starting config
config = LoraConfig(
    r=16,              # Rank: 8-128, higher = more capacity
    lora_alpha=32,     # Scaling: typically 2x rank
    target_modules=[   # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# For domain-specific tasks needing high fidelity:
# r=64-128, use_rslora=True, alpha=2*r

Start with r=16 and scale up only if evaluation metrics plateau. Higher rank means more trainable parameters and longer training — but doesn't always mean better results. I've seen r=16 outperform r=128 on well-curated datasets.

Training Loop

I use the Hugging Face Trainer with a few specific settings. Learning rate around 2e-4 works well for LoRA. Three epochs is usually sufficient — beyond that you risk overfitting, especially with smaller datasets.

python
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./checkpoints",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    load_best_model_at_end=True,
    bf16=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Evaluation Strategy

Automated metrics (loss, perplexity) only tell part of the story. I always run an A/B comparison: generate outputs from both the fine-tuned and base models on the same test set, then score them side-by-side on task-specific rubrics. This catches cases where loss decreases but output quality doesn't actually improve.

Production Deployment

The beauty of LoRA is deployment simplicity. Your adapter is typically 50-500MB regardless of the base model size. Load the base model once, merge the adapter, and serve. For multi-tenant scenarios, you can hot-swap adapters without reloading the base model.

Always version your adapters alongside your training data and config. A reproducible fine-tune pipeline — from raw data to deployed adapter — is worth more than any single model checkpoint.