Fine-Tuning LLMs with LoRA: A Practical Guide
A hands-on guide to parameter-efficient fine-tuning — from dataset preparation and LoRA configuration to training loops, evaluation, and production deployment.
Full fine-tuning of a 7B+ parameter model requires multiple high-end GPUs and days of training. LoRA (Low-Rank Adaptation) changes the equation entirely — you can fine-tune on a single GPU in hours, producing a lightweight adapter that slots into the frozen base model. Here's how I approach it in practice.
When to Fine-Tune vs. When to Prompt
Not every task needs fine-tuning. I use a simple decision framework: if you can describe the desired behavior in a prompt and the model consistently follows it, prompt engineering is sufficient. Fine-tune when you need consistent adherence to a specific format, style, or domain language that prompting alone can't reliably produce — or when you need to reduce latency by removing complex system prompts.
Dataset Preparation
The quality of your training data determines the ceiling of your fine-tune. I follow a three-stage pipeline: generate, score, and augment.
Stage 1: Generate Teacher Data
Use a stronger model (GPT-4, Claude) to generate (input, output) pairs for your task. Be specific in your prompts about format, length, and quality criteria. I typically generate 500-1000 initial examples.
# Generate training pairs with quality criteria
prompt = f"""Generate a training example for {task_description}.
Input: {sample_input}
Output requirements:
- Must follow {format_spec}
- Must include {quality_criteria}
- Must NOT include {negative_criteria}
Output:"""
response = teacher_model.generate(prompt, temperature=0.7)Stage 2: Quality Scoring
Every generated example gets scored on a 0-5 rubric. I automate this with a scoring model that checks for format compliance, factual consistency, and task-specific criteria. Only examples scoring 4+ make it into the final training set.
Stage 3: Augmentation
Paraphrase augmentation doubles the effective dataset size. I also add edge cases — short inputs, ambiguous scenarios, adversarial prompts. The final split is 90% train, 10% eval.
LoRA Configuration
The key hyperparameters are rank (r), alpha, target modules, and dropout. Here's what I've found works across multiple projects.
from peft import LoraConfig
# Recommended starting config
config = LoraConfig(
r=16, # Rank: 8-128, higher = more capacity
lora_alpha=32, # Scaling: typically 2x rank
target_modules=[ # Which layers to adapt
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# For domain-specific tasks needing high fidelity:
# r=64-128, use_rslora=True, alpha=2*rStart with r=16 and scale up only if evaluation metrics plateau. Higher rank means more trainable parameters and longer training — but doesn't always mean better results. I've seen r=16 outperform r=128 on well-curated datasets.
Training Loop
I use the Hugging Face Trainer with a few specific settings. Learning rate around 2e-4 works well for LoRA. Three epochs is usually sufficient — beyond that you risk overfitting, especially with smaller datasets.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.05,
lr_scheduler_type="cosine",
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=50,
load_best_model_at_end=True,
bf16=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()Evaluation Strategy
Automated metrics (loss, perplexity) only tell part of the story. I always run an A/B comparison: generate outputs from both the fine-tuned and base models on the same test set, then score them side-by-side on task-specific rubrics. This catches cases where loss decreases but output quality doesn't actually improve.
Production Deployment
The beauty of LoRA is deployment simplicity. Your adapter is typically 50-500MB regardless of the base model size. Load the base model once, merge the adapter, and serve. For multi-tenant scenarios, you can hot-swap adapters without reloading the base model.
Always version your adapters alongside your training data and config. A reproducible fine-tune pipeline — from raw data to deployed adapter — is worth more than any single model checkpoint.