fine-tuningLLMLoRAQLoRARLHFDPOprompt engineeringRAGmachine learning

Fine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows

March 29, 2026 8 min readBy Codelit Team Discussion

Fine-tuning a large language model is one of the most powerful — and most misused — techniques in AI engineering. Most teams jump to fine-tuning when prompt engineering or RAG would solve the problem faster and cheaper. This guide walks through when fine-tuning actually makes sense, how to do it well, and the tools that make it practical.

When to Fine-Tune vs RAG vs Prompt Engineering#

Before committing to fine-tuning, understand the three main approaches to customizing LLM behavior:

Prompt engineering is the first thing to try. It requires no training, no data collection, and no GPUs. You write instructions, examples, and constraints directly in the prompt. If you can solve the problem with a well-crafted system prompt and a few-shot examples, stop there.

Retrieval-Augmented Generation (RAG) is the right choice when the model needs access to specific, frequently changing knowledge. Instead of baking facts into the model, you retrieve relevant documents at inference time and include them in the context. RAG is ideal for knowledge bases, documentation assistants, and domain-specific Q&A.

Fine-tuning is appropriate when you need to change the model's behavior, tone, or output format in ways that prompting cannot reliably achieve. Common use cases include:

Consistent structured output (JSON schemas, code in a specific style)
Domain-specific language patterns (legal, medical, financial terminology)
Reducing latency by eliminating long system prompts
Teaching the model a task that requires many examples to get right
Aligning the model with specific safety or brand guidelines

Decision Framework#

Ask these questions in order:

Can I solve this with a better prompt? If yes, do that.
Does the model need knowledge it does not have? Use RAG.
Do I need the model to behave differently at a fundamental level? Fine-tune.
Do I need both new knowledge and new behavior? Combine RAG with fine-tuning.

Training Data Preparation#

The quality of your fine-tuning data determines the quality of your model. Bad data produces bad models regardless of technique.

Data Format#

Most fine-tuning workflows use instruction-response pairs in JSONL format:

{"messages": [
  {"role": "system", "content": "You are a medical coding assistant."},
  {"role": "user", "content": "Patient presents with acute bronchitis."},
  {"role": "assistant", "content": "ICD-10: J20.9 — Acute bronchitis, unspecified"}
]}

Data Quality Checklist#

Volume: 100-1000 high-quality examples is a practical starting point. More is not always better if the data is noisy.
Diversity: Cover the full range of inputs the model will encounter in production.
Consistency: All examples should follow the same format, tone, and quality standard.
Deduplication: Remove near-duplicates that would bias the model toward specific patterns.
Validation: Have domain experts review a random sample before training.

Common Data Mistakes#

Using synthetic data exclusively without human verification
Including examples that contradict each other
Overrepresenting easy cases and underrepresenting edge cases
Forgetting to include negative examples (what the model should refuse)

LoRA and QLoRA: Parameter-Efficient Fine-Tuning#

Full fine-tuning updates every parameter in the model, which requires enormous GPU memory. A 7B parameter model needs roughly 28 GB just for the weights in fp32, plus optimizer states.

LoRA (Low-Rank Adaptation)#

LoRA freezes the original model weights and injects small trainable matrices into each transformer layer. Instead of updating a weight matrix W directly, LoRA decomposes the update into two small matrices: W + BA, where B and A have much lower rank than W.

Key benefits:

Trains only 0.1-1% of total parameters
Produces small adapter files (tens of MB instead of tens of GB)
Multiple LoRA adapters can be swapped at inference time
Original model weights remain unchanged

Typical LoRA hyperparameters:

r (rank): 8-64. Higher rank captures more complex adaptations but uses more memory.
alpha: Usually 2x the rank. Controls the scaling of the LoRA update.
target_modules: Which layers to adapt. Start with attention layers (q_proj, v_proj), expand to MLP layers if needed.

QLoRA (Quantized LoRA)#

QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are stored in 4-bit NormalFloat format, while the LoRA adapters train in bf16. This dramatically reduces memory requirements:

Fine-tune a 7B model on a single 24 GB GPU
Fine-tune a 70B model on a single 80 GB A100
Minimal quality loss compared to full-precision LoRA

QLoRA uses double quantization and paged optimizers to further reduce memory footprint.

RLHF and DPO: Alignment Techniques#

After supervised fine-tuning (SFT), you may want to further align the model with human preferences.

RLHF (Reinforcement Learning from Human Feedback)#

RLHF is a three-stage process:

SFT: Fine-tune the base model on high-quality demonstrations
Reward model training: Train a separate model to predict human preferences between pairs of responses
PPO optimization: Use the reward model to guide reinforcement learning, optimizing the SFT model to produce responses that score highly

RLHF is powerful but complex. It requires training two models, managing reward hacking, and careful hyperparameter tuning.

DPO (Direct Preference Optimization)#

DPO simplifies alignment by eliminating the reward model entirely. Instead, it directly optimizes the language model using preference pairs:

{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing uses qubits that can exist in superposition...",
  "rejected": "Quantum computing is basically just faster computers..."
}

DPO is simpler to implement, more stable to train, and often produces comparable results to RLHF. It has become the preferred alignment technique for most teams.

Evaluation#

Training a fine-tuned model without proper evaluation is flying blind. Set up evaluation before you start training.

Automated Metrics#

Loss curves: Monitor training and validation loss. Divergence signals overfitting.
Perplexity: Lower is better, but only meaningful when compared against a baseline.
Task-specific metrics: Accuracy, F1, BLEU, ROUGE — whatever matches your use case.

LLM-as-Judge#

Use a stronger model (like GPT-4 or Claude) to evaluate your fine-tuned model's outputs. Define rubrics with specific criteria and score on a scale. This is faster than human evaluation and correlates well with human judgments when the rubric is clear.

Human Evaluation#

For high-stakes applications, there is no substitute for human evaluation. Use blind A/B comparisons between the base model, the fine-tuned model, and previous versions. Track inter-annotator agreement to ensure your evaluation is reliable.

Evaluation Pitfalls#

Evaluating only on examples similar to training data (test on out-of-distribution inputs)
Ignoring regression on general capabilities
Not testing for hallucination rates before and after fine-tuning

Tools and Platforms#

Axolotl#

An open-source fine-tuning framework that supports LoRA, QLoRA, full fine-tuning, DPO, and RLHF. Configuration-driven with YAML files. Handles multi-GPU training, dataset mixing, and flash attention automatically. Best for teams that want control without writing training loops from scratch.

Unsloth#

Optimized for speed. Unsloth patches the training loop to achieve 2-5x faster fine-tuning with 60% less memory. Supports LoRA and QLoRA on Llama, Mistral, and other popular architectures. Great for rapid iteration on consumer GPUs.

OpenAI Fine-Tuning API#

The simplest path if you are already in the OpenAI ecosystem. Upload a JSONL file, configure epochs, and the API handles everything. Limited to OpenAI models (GPT-4o-mini, GPT-4o). No infrastructure management, but less control over hyperparameters and no access to model weights.

Other Notable Tools#

Hugging Face TRL: The reference library for SFT, DPO, and RLHF training
MLflow / Weights & Biases: Experiment tracking and model versioning
vLLM / TGI: Efficient inference serving for your fine-tuned models

Production Workflow#

A practical fine-tuning pipeline looks like this:

Baseline: Establish performance with prompt engineering
Data collection: Gather and curate training examples
SFT training: Fine-tune with LoRA/QLoRA, monitoring loss curves
Evaluation: Run automated and human evaluations
Alignment (optional): Apply DPO if preference data is available
Deployment: Serve with vLLM or merge LoRA weights into the base model
Monitoring: Track production metrics and collect feedback for the next iteration

Summary#

Fine-tuning is a precision tool, not a default strategy. Start with prompt engineering, add RAG for knowledge, and fine-tune only when you need to fundamentally change model behavior. When you do fine-tune, use LoRA/QLoRA for efficiency, prepare high-quality data, evaluate rigorously, and consider DPO for alignment. The tooling has matured — Axolotl, Unsloth, and the OpenAI API make the process accessible to any engineering team.

Build smarter AI systems with us at codelit.io.

Article #334 on Codelit — Keep building, keep shipping.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

Agentic RAG Architecture for Internal Tools

3 min read

AI agents

Agent Workflows for AI Infrastructure Teams

2 min read

Try these templates

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Machine Learning Pipeline

End-to-end ML platform with data ingestion, feature engineering, training, serving, and monitoring.

8 components

Dropbox Cloud Storage Platform

Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.

10 components

Build this architecture

Generate an interactive architecture for Fine in seconds.

Try it in Codelit →

fine-tuningLLMLoRAQLoRARLHFDPOprompt engineeringRAGmachine learning

Fine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows

March 29, 2026 8 min readBy Codelit Team Discussion

When to Fine-Tune vs RAG vs Prompt Engineering#

Before committing to fine-tuning, understand the three main approaches to customizing LLM behavior:

Fine-tuning is appropriate when you need to change the model's behavior, tone, or output format in ways that prompting cannot reliably achieve. Common use cases include:

Consistent structured output (JSON schemas, code in a specific style)
Domain-specific language patterns (legal, medical, financial terminology)
Reducing latency by eliminating long system prompts
Teaching the model a task that requires many examples to get right
Aligning the model with specific safety or brand guidelines

Decision Framework#

Ask these questions in order:

Can I solve this with a better prompt? If yes, do that.
Does the model need knowledge it does not have? Use RAG.
Do I need the model to behave differently at a fundamental level? Fine-tune.
Do I need both new knowledge and new behavior? Combine RAG with fine-tuning.

Training Data Preparation#

The quality of your fine-tuning data determines the quality of your model. Bad data produces bad models regardless of technique.

Data Format#

Most fine-tuning workflows use instruction-response pairs in JSONL format:

{"messages": [
  {"role": "system", "content": "You are a medical coding assistant."},
  {"role": "user", "content": "Patient presents with acute bronchitis."},
  {"role": "assistant", "content": "ICD-10: J20.9 — Acute bronchitis, unspecified"}
]}

Data Quality Checklist#

Volume: 100-1000 high-quality examples is a practical starting point. More is not always better if the data is noisy.
Diversity: Cover the full range of inputs the model will encounter in production.
Consistency: All examples should follow the same format, tone, and quality standard.
Deduplication: Remove near-duplicates that would bias the model toward specific patterns.
Validation: Have domain experts review a random sample before training.

Common Data Mistakes#

Using synthetic data exclusively without human verification
Including examples that contradict each other
Overrepresenting easy cases and underrepresenting edge cases
Forgetting to include negative examples (what the model should refuse)

LoRA and QLoRA: Parameter-Efficient Fine-Tuning#

Full fine-tuning updates every parameter in the model, which requires enormous GPU memory. A 7B parameter model needs roughly 28 GB just for the weights in fp32, plus optimizer states.

LoRA (Low-Rank Adaptation)#

Key benefits:

Trains only 0.1-1% of total parameters
Produces small adapter files (tens of MB instead of tens of GB)
Multiple LoRA adapters can be swapped at inference time
Original model weights remain unchanged

Typical LoRA hyperparameters:

r (rank): 8-64. Higher rank captures more complex adaptations but uses more memory.
alpha: Usually 2x the rank. Controls the scaling of the LoRA update.
target_modules: Which layers to adapt. Start with attention layers (q_proj, v_proj), expand to MLP layers if needed.

QLoRA (Quantized LoRA)#

Fine-tune a 7B model on a single 24 GB GPU
Fine-tune a 70B model on a single 80 GB A100
Minimal quality loss compared to full-precision LoRA

QLoRA uses double quantization and paged optimizers to further reduce memory footprint.

RLHF and DPO: Alignment Techniques#

After supervised fine-tuning (SFT), you may want to further align the model with human preferences.

RLHF (Reinforcement Learning from Human Feedback)#

RLHF is a three-stage process:

SFT: Fine-tune the base model on high-quality demonstrations
Reward model training: Train a separate model to predict human preferences between pairs of responses
PPO optimization: Use the reward model to guide reinforcement learning, optimizing the SFT model to produce responses that score highly

RLHF is powerful but complex. It requires training two models, managing reward hacking, and careful hyperparameter tuning.

DPO (Direct Preference Optimization)#

DPO simplifies alignment by eliminating the reward model entirely. Instead, it directly optimizes the language model using preference pairs:

{
  "prompt": "Explain quantum computing",
  "chosen": "Quantum computing uses qubits that can exist in superposition...",
  "rejected": "Quantum computing is basically just faster computers..."
}

DPO is simpler to implement, more stable to train, and often produces comparable results to RLHF. It has become the preferred alignment technique for most teams.

Evaluation#

Training a fine-tuned model without proper evaluation is flying blind. Set up evaluation before you start training.

Automated Metrics#

Loss curves: Monitor training and validation loss. Divergence signals overfitting.
Perplexity: Lower is better, but only meaningful when compared against a baseline.
Task-specific metrics: Accuracy, F1, BLEU, ROUGE — whatever matches your use case.

LLM-as-Judge#

Human Evaluation#

Evaluation Pitfalls#

Evaluating only on examples similar to training data (test on out-of-distribution inputs)
Ignoring regression on general capabilities
Not testing for hallucination rates before and after fine-tuning

Tools and Platforms#

Axolotl#

Unsloth#

OpenAI Fine-Tuning API#

Other Notable Tools#

Hugging Face TRL: The reference library for SFT, DPO, and RLHF training
MLflow / Weights & Biases: Experiment tracking and model versioning
vLLM / TGI: Efficient inference serving for your fine-tuned models

Production Workflow#

A practical fine-tuning pipeline looks like this:

Baseline: Establish performance with prompt engineering
Data collection: Gather and curate training examples
SFT training: Fine-tune with LoRA/QLoRA, monitoring loss curves
Evaluation: Run automated and human evaluations
Alignment (optional): Apply DPO if preference data is available
Deployment: Serve with vLLM or merge LoRA weights into the base model
Monitoring: Track production metrics and collect feedback for the next iteration

Summary#

Build smarter AI systems with us at codelit.io.

Article #334 on Codelit — Keep building, keep shipping.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Fine in seconds.

Try it in Codelit →

Fine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows

When to Fine-Tune vs RAG vs Prompt Engineering#

Decision Framework#

Training Data Preparation#

Data Format#

Data Quality Checklist#

Common Data Mistakes#

LoRA and QLoRA: Parameter-Efficient Fine-Tuning#

LoRA (Low-Rank Adaptation)#

QLoRA (Quantized LoRA)#

RLHF and DPO: Alignment Techniques#

RLHF (Reinforcement Learning from Human Feedback)#

DPO (Direct Preference Optimization)#

Evaluation#

Automated Metrics#

LLM-as-Judge#

Human Evaluation#

Evaluation Pitfalls#

Tools and Platforms#

Axolotl#

Unsloth#

OpenAI Fine-Tuning API#

Other Notable Tools#

Production Workflow#

Summary#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

Agent Workflows for AI Infrastructure Teams

Try these templates

Cloud File Storage Platform

Machine Learning Pipeline

Dropbox Cloud Storage Platform

Build this architecture

Fine-Tuning LLMs: When to Fine-Tune, LoRA, QLoRA, and Production Workflows

When to Fine-Tune vs RAG vs Prompt Engineering#

Decision Framework#

Training Data Preparation#

Data Format#

Data Quality Checklist#

Common Data Mistakes#

LoRA and QLoRA: Parameter-Efficient Fine-Tuning#

LoRA (Low-Rank Adaptation)#

QLoRA (Quantized LoRA)#

RLHF and DPO: Alignment Techniques#

RLHF (Reinforcement Learning from Human Feedback)#

DPO (Direct Preference Optimization)#

Evaluation#

Automated Metrics#

LLM-as-Judge#

Human Evaluation#

Evaluation Pitfalls#

Tools and Platforms#

Axolotl#

Unsloth#

OpenAI Fine-Tuning API#

Other Notable Tools#

Production Workflow#

Summary#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

Agent Workflows for AI Infrastructure Teams

Try these templates

Cloud File Storage Platform

Machine Learning Pipeline

Dropbox Cloud Storage Platform

Build this architecture