Fine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows
Fine-tuning a large language model is one of the most powerful — and most misused — techniques in AI engineering. Most teams jump to fine-tuning when prompt engineering or RAG would solve the problem faster and cheaper. This guide walks through when fine-tuning actually makes sense, how to do it well, and the tools that make it practical.
When to Fine-Tune vs RAG vs Prompt Engineering#
Before committing to fine-tuning, understand the three main approaches to customizing LLM behavior:
Prompt engineering is the first thing to try. It requires no training, no data collection, and no GPUs. You write instructions, examples, and constraints directly in the prompt. If you can solve the problem with a well-crafted system prompt and a few-shot examples, stop there.
Retrieval-Augmented Generation (RAG) is the right choice when the model needs access to specific, frequently changing knowledge. Instead of baking facts into the model, you retrieve relevant documents at inference time and include them in the context. RAG is ideal for knowledge bases, documentation assistants, and domain-specific Q&A.
Fine-tuning is appropriate when you need to change the model's behavior, tone, or output format in ways that prompting cannot reliably achieve. Common use cases include:
- Consistent structured output (JSON schemas, code in a specific style)
- Domain-specific language patterns (legal, medical, financial terminology)
- Reducing latency by eliminating long system prompts
- Teaching the model a task that requires many examples to get right
- Aligning the model with specific safety or brand guidelines
Decision Framework#
Ask these questions in order:
- Can I solve this with a better prompt? If yes, do that.
- Does the model need knowledge it does not have? Use RAG.
- Do I need the model to behave differently at a fundamental level? Fine-tune.
- Do I need both new knowledge and new behavior? Combine RAG with fine-tuning.
Training Data Preparation#
The quality of your fine-tuning data determines the quality of your model. Bad data produces bad models regardless of technique.
Data Format#
Most fine-tuning workflows use instruction-response pairs in JSONL format:
{"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Patient presents with acute bronchitis."},
{"role": "assistant", "content": "ICD-10: J20.9 — Acute bronchitis, unspecified"}
]}
Data Quality Checklist#
- Volume: 100-1000 high-quality examples is a practical starting point. More is not always better if the data is noisy.
- Diversity: Cover the full range of inputs the model will encounter in production.
- Consistency: All examples should follow the same format, tone, and quality standard.
- Deduplication: Remove near-duplicates that would bias the model toward specific patterns.
- Validation: Have domain experts review a random sample before training.
Common Data Mistakes#
- Using synthetic data exclusively without human verification
- Including examples that contradict each other
- Overrepresenting easy cases and underrepresenting edge cases
- Forgetting to include negative examples (what the model should refuse)
LoRA and QLoRA: Parameter-Efficient Fine-Tuning#
Full fine-tuning updates every parameter in the model, which requires enormous GPU memory. A 7B parameter model needs roughly 28 GB just for the weights in fp32, plus optimizer states.
LoRA (Low-Rank Adaptation)#
LoRA freezes the original model weights and injects small trainable matrices into each transformer layer. Instead of updating a weight matrix W directly, LoRA decomposes the update into two small matrices: W + BA, where B and A have much lower rank than W.
Key benefits:
- Trains only 0.1-1% of total parameters
- Produces small adapter files (tens of MB instead of tens of GB)
- Multiple LoRA adapters can be swapped at inference time
- Original model weights remain unchanged
Typical LoRA hyperparameters:
r(rank): 8-64. Higher rank captures more complex adaptations but uses more memory.alpha: Usually 2x the rank. Controls the scaling of the LoRA update.target_modules: Which layers to adapt. Start with attention layers (q_proj,v_proj), expand to MLP layers if needed.
QLoRA (Quantized LoRA)#
QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are stored in 4-bit NormalFloat format, while the LoRA adapters train in bf16. This dramatically reduces memory requirements:
- Fine-tune a 7B model on a single 24 GB GPU
- Fine-tune a 70B model on a single 80 GB A100
- Minimal quality loss compared to full-precision LoRA
QLoRA uses double quantization and paged optimizers to further reduce memory footprint.
RLHF and DPO: Alignment Techniques#
After supervised fine-tuning (SFT), you may want to further align the model with human preferences.
RLHF (Reinforcement Learning from Human Feedback)#
RLHF is a three-stage process:
- SFT: Fine-tune the base model on high-quality demonstrations
- Reward model training: Train a separate model to predict human preferences between pairs of responses
- PPO optimization: Use the reward model to guide reinforcement learning, optimizing the SFT model to produce responses that score highly
RLHF is powerful but complex. It requires training two models, managing reward hacking, and careful hyperparameter tuning.
DPO (Direct Preference Optimization)#
DPO simplifies alignment by eliminating the reward model entirely. Instead, it directly optimizes the language model using preference pairs:
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing uses qubits that can exist in superposition...",
"rejected": "Quantum computing is basically just faster computers..."
}
DPO is simpler to implement, more stable to train, and often produces comparable results to RLHF. It has become the preferred alignment technique for most teams.
Evaluation#
Training a fine-tuned model without proper evaluation is flying blind. Set up evaluation before you start training.
Automated Metrics#
- Loss curves: Monitor training and validation loss. Divergence signals overfitting.
- Perplexity: Lower is better, but only meaningful when compared against a baseline.
- Task-specific metrics: Accuracy, F1, BLEU, ROUGE — whatever matches your use case.
LLM-as-Judge#
Use a stronger model (like GPT-4 or Claude) to evaluate your fine-tuned model's outputs. Define rubrics with specific criteria and score on a scale. This is faster than human evaluation and correlates well with human judgments when the rubric is clear.
Human Evaluation#
For high-stakes applications, there is no substitute for human evaluation. Use blind A/B comparisons between the base model, the fine-tuned model, and previous versions. Track inter-annotator agreement to ensure your evaluation is reliable.
Evaluation Pitfalls#
- Evaluating only on examples similar to training data (test on out-of-distribution inputs)
- Ignoring regression on general capabilities
- Not testing for hallucination rates before and after fine-tuning
Tools and Platforms#
Axolotl#
An open-source fine-tuning framework that supports LoRA, QLoRA, full fine-tuning, DPO, and RLHF. Configuration-driven with YAML files. Handles multi-GPU training, dataset mixing, and flash attention automatically. Best for teams that want control without writing training loops from scratch.
Unsloth#
Optimized for speed. Unsloth patches the training loop to achieve 2-5x faster fine-tuning with 60% less memory. Supports LoRA and QLoRA on Llama, Mistral, and other popular architectures. Great for rapid iteration on consumer GPUs.
OpenAI Fine-Tuning API#
The simplest path if you are already in the OpenAI ecosystem. Upload a JSONL file, configure epochs, and the API handles everything. Limited to OpenAI models (GPT-4o-mini, GPT-4o). No infrastructure management, but less control over hyperparameters and no access to model weights.
Other Notable Tools#
- Hugging Face TRL: The reference library for SFT, DPO, and RLHF training
- MLflow / Weights & Biases: Experiment tracking and model versioning
- vLLM / TGI: Efficient inference serving for your fine-tuned models
Production Workflow#
A practical fine-tuning pipeline looks like this:
- Baseline: Establish performance with prompt engineering
- Data collection: Gather and curate training examples
- SFT training: Fine-tune with LoRA/QLoRA, monitoring loss curves
- Evaluation: Run automated and human evaluations
- Alignment (optional): Apply DPO if preference data is available
- Deployment: Serve with vLLM or merge LoRA weights into the base model
- Monitoring: Track production metrics and collect feedback for the next iteration
Summary#
Fine-tuning is a precision tool, not a default strategy. Start with prompt engineering, add RAG for knowledge, and fine-tune only when you need to fundamentally change model behavior. When you do fine-tune, use LoRA/QLoRA for efficiency, prepare high-quality data, evaluate rigorously, and consider DPO for alignment. The tooling has matured — Axolotl, Unsloth, and the OpenAI API make the process accessible to any engineering team.
Build smarter AI systems with us at codelit.io.
Article #334 on Codelit — Keep building, keep shipping.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsMachine Learning Pipeline
End-to-end ML platform with data ingestion, feature engineering, training, serving, and monitoring.
8 componentsDropbox Cloud Storage Platform
Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.
10 components
Comments