Retrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation
Retrieval-Augmented Generation (RAG)#
Large language models know a lot, but they do not know your data. They hallucinate when asked about private documents, recent events, or domain-specific facts.
RAG fixes this by retrieving relevant context from your own data and injecting it into the prompt before generation. The LLM answers based on evidence, not memory.
RAG Architecture Overview#
Every RAG system has two phases:
Indexing (Offline)#
- Load — ingest documents from sources (PDFs, databases, APIs, wikis)
- Chunk — split documents into retrieval-sized pieces
- Embed — convert each chunk into a vector using an embedding model
- Store — write vectors and metadata into a vector database
Query (Online)#
- Embed the query — convert the user question into a vector
- Retrieve — find the top-K most similar chunks via ANN search
- Rerank (optional) — re-score results with a cross-encoder for precision
- Augment — insert retrieved chunks into the LLM prompt
- Generate — the LLM produces an answer grounded in the retrieved context
Chunking Strategies#
Chunking is the most underrated part of RAG. Bad chunks produce bad retrieval, which produces bad answers.
Fixed-Size Chunking#
Split text every N tokens (e.g., 512) with optional overlap (e.g., 50 tokens). Simple and predictable.
Pros: Easy to implement, consistent chunk sizes. Cons: Splits mid-sentence, mid-paragraph, or mid-thought.
Recursive Character Splitting#
Split by paragraph, then sentence, then character — stopping as soon as chunks are under the size limit. This is the default in LangChain.
Pros: Respects natural text boundaries. Cons: Variable chunk sizes can make retrieval scoring inconsistent.
Semantic Chunking#
Use an embedding model to detect topic shifts. Start a new chunk when the embedding similarity between consecutive sentences drops below a threshold.
Pros: Each chunk contains a coherent topic. Cons: Slower to compute, requires an extra embedding pass.
Document-Aware Chunking#
Use document structure — headings, sections, code blocks, tables — to define chunk boundaries. Libraries like Unstructured and LlamaIndex support this.
Pros: Preserves the author's intended structure. Cons: Requires format-specific parsers.
Chunk Size Guidelines#
- 256-512 tokens — good default for most use cases
- 128 tokens — better for fine-grained retrieval (Q&A over technical docs)
- 1024 tokens — better when chunks need surrounding context (legal, medical)
- 10-20% overlap — prevents losing information at chunk boundaries
Embedding Models#
The embedding model determines retrieval quality. Key options:
| Model | Dimensions | Context | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8K tokens | Strong general-purpose, supports dimension reduction |
| OpenAI text-embedding-3-small | 1536 | 8K tokens | Lower cost, good for most use cases |
| Cohere embed-v3 | 1024 | 512 tokens | Supports search_document and search_query input types |
| BGE-large-en-v1.5 | 1024 | 512 tokens | Open-source, strong MTEB scores |
| Nomic embed-text-v1.5 | 768 | 8K tokens | Open-source, long context, Matryoshka support |
| GTE-Qwen2 | 1536 | 32K tokens | Open-source, very long context |
Best practice: Always use the same model for indexing and querying. Mixing models produces incompatible vector spaces.
Vector Store Selection#
Your vector store holds the indexed chunks. Options ranked by complexity:
- In-memory (FAISS, NumPy) — prototyping, under 100K chunks
- Chroma — lightweight, Python-native, good for local dev
- pgvector — vectors alongside relational data in PostgreSQL
- Pinecone — fully managed, zero-ops, metadata filtering
- Weaviate — open-source, built-in vectorization, hybrid search
- Milvus — open-source, billion-scale, GPU acceleration
Reranking#
Initial retrieval with ANN search optimizes for speed. Reranking optimizes for precision.
How it works:
- Retrieve top-50 candidates via vector similarity (fast, bi-encoder)
- Pass each (query, candidate) pair through a cross-encoder
- The cross-encoder scores relevance more accurately because it sees both texts together
- Return the top-5 reranked results
Tools:
- Cohere Rerank API
- bge-reranker-v2 (open-source)
- Jina Reranker
- FlashRank (lightweight, local)
Reranking typically improves answer quality by 10-25% in benchmarks.
Context Window Management#
LLMs have finite context windows. You cannot dump 100 retrieved chunks into a prompt.
Strategies:
Token Budget#
Allocate a fixed token budget for context (e.g., 4K out of 8K total). Fill it with the highest-ranked chunks until the budget is full.
Map-Reduce#
For questions that span many documents, summarize each chunk independently (map), then combine summaries into a final answer (reduce). LangChain supports this via MapReduceDocumentsChain.
Stuffing with Compression#
Use an LLM or an extractive model to compress each chunk to only the relevant sentences before stuffing them into the prompt. LangChain's ContextualCompressionRetriever does this.
Hierarchical Retrieval#
Index both summaries (coarse) and full chunks (fine). First retrieve relevant summaries, then drill into the full chunks of matched summaries. LlamaIndex calls this "recursive retrieval."
Advanced RAG Patterns#
Multi-Query RAG#
Generate multiple reformulations of the user question, retrieve for each, and merge results. This captures different angles of the same question and improves recall.
HyDE (Hypothetical Document Embeddings)#
Ask the LLM to generate a hypothetical answer, embed that answer, and use it as the retrieval query. The hypothetical answer is often closer in embedding space to the real documents than the question itself.
Self-RAG#
The LLM decides whether retrieval is needed, retrieves if so, critiques the retrieved passages, and generates a response. This avoids unnecessary retrieval for simple factual questions.
Corrective RAG (CRAG)#
After retrieval, a grader evaluates whether the retrieved documents are relevant. If not, the system falls back to web search or asks for clarification.
Evaluation Metrics#
RAG has two failure modes: bad retrieval and bad generation. Measure both.
Retrieval Metrics#
- Recall@K — what fraction of relevant documents appear in the top-K results?
- MRR (Mean Reciprocal Rank) — how high does the first relevant result rank?
- NDCG — measures ranking quality accounting for position
Generation Metrics#
- Faithfulness — does the answer use only information from the retrieved context? (no hallucination)
- Answer relevance — does the answer actually address the question?
- Context relevance — are the retrieved chunks relevant to the question?
Evaluation Tools#
- RAGAS — open-source framework that scores faithfulness, relevance, and context quality
- DeepEval — LLM-based evaluation with customizable metrics
- LangSmith — tracing and evaluation platform from LangChain
- Phoenix (Arize) — observability and evaluation for LLM applications
Tools and Frameworks#
LangChain#
The most popular RAG framework. Provides document loaders, text splitters, embedding integrations, vector store connectors, and chain abstractions. Supports Python and JavaScript.
LlamaIndex#
Purpose-built for RAG. Excels at data ingestion, advanced indexing (tree, keyword table, knowledge graph), and query engines. Strong support for hierarchical and recursive retrieval.
Haystack#
Open-source by deepset. Pipeline-based architecture with nodes for retrieval, reranking, and generation. Good for production deployments with REST API support.
Common Mistakes#
- Chunks too large — retrieval returns irrelevant padding around the useful content
- No overlap — critical information at chunk boundaries is lost
- Ignoring metadata — filtering by source, date, or category before vector search dramatically improves relevance
- Skipping reranking — the jump from top-50 to top-5 matters more than you think
- No evaluation — you cannot improve what you do not measure
Start Building#
RAG is the bridge between LLMs and your private data. Get chunking right, pick the right embedding model, add a reranker, and measure everything. The difference between a demo and a production RAG system is in these details.
Design your RAG architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.
Article #326 of 327. Explore all articles, templates, and tools at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsDropbox Cloud Storage Platform
Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.
10 componentsAWS S3 Object Storage Architecture
Scalable object storage with versioning, lifecycle policies, replication, access control, and event-driven processing.
10 components
Comments