retrieval augmented generationRAGLangChainLlamaIndexembeddingsvector search

Retrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation

March 29, 2026 7 min readBy Codelit Team Discussion

Retrieval-Augmented Generation (RAG)#

Large language models know a lot, but they do not know your data. They hallucinate when asked about private documents, recent events, or domain-specific facts.

RAG fixes this by retrieving relevant context from your own data and injecting it into the prompt before generation. The LLM answers based on evidence, not memory.

RAG Architecture Overview#

Every RAG system has two phases:

Indexing (Offline)#

Load — ingest documents from sources (PDFs, databases, APIs, wikis)
Chunk — split documents into retrieval-sized pieces
Embed — convert each chunk into a vector using an embedding model
Store — write vectors and metadata into a vector database

Query (Online)#

Embed the query — convert the user question into a vector
Retrieve — find the top-K most similar chunks via ANN search
Rerank (optional) — re-score results with a cross-encoder for precision
Augment — insert retrieved chunks into the LLM prompt
Generate — the LLM produces an answer grounded in the retrieved context

Chunking Strategies#

Chunking is the most underrated part of RAG. Bad chunks produce bad retrieval, which produces bad answers.

Fixed-Size Chunking#

Split text every N tokens (e.g., 512) with optional overlap (e.g., 50 tokens). Simple and predictable.

Pros: Easy to implement, consistent chunk sizes. Cons: Splits mid-sentence, mid-paragraph, or mid-thought.

Recursive Character Splitting#

Split by paragraph, then sentence, then character — stopping as soon as chunks are under the size limit. This is the default in LangChain.

Pros: Respects natural text boundaries. Cons: Variable chunk sizes can make retrieval scoring inconsistent.

Semantic Chunking#

Use an embedding model to detect topic shifts. Start a new chunk when the embedding similarity between consecutive sentences drops below a threshold.

Pros: Each chunk contains a coherent topic. Cons: Slower to compute, requires an extra embedding pass.

Document-Aware Chunking#

Use document structure — headings, sections, code blocks, tables — to define chunk boundaries. Libraries like Unstructured and LlamaIndex support this.

Pros: Preserves the author's intended structure. Cons: Requires format-specific parsers.

Chunk Size Guidelines#

256-512 tokens — good default for most use cases
128 tokens — better for fine-grained retrieval (Q&A over technical docs)
1024 tokens — better when chunks need surrounding context (legal, medical)
10-20% overlap — prevents losing information at chunk boundaries

Embedding Models#

The embedding model determines retrieval quality. Key options:

Model	Dimensions	Context	Notes
OpenAI text-embedding-3-large	3072	8K tokens	Strong general-purpose, supports dimension reduction
OpenAI text-embedding-3-small	1536	8K tokens	Lower cost, good for most use cases
Cohere embed-v3	1024	512 tokens	Supports search_document and search_query input types
BGE-large-en-v1.5	1024	512 tokens	Open-source, strong MTEB scores
Nomic embed-text-v1.5	768	8K tokens	Open-source, long context, Matryoshka support
GTE-Qwen2	1536	32K tokens	Open-source, very long context

Best practice: Always use the same model for indexing and querying. Mixing models produces incompatible vector spaces.

Vector Store Selection#

Your vector store holds the indexed chunks. Options ranked by complexity:

In-memory (FAISS, NumPy) — prototyping, under 100K chunks
Chroma — lightweight, Python-native, good for local dev
pgvector — vectors alongside relational data in PostgreSQL
Pinecone — fully managed, zero-ops, metadata filtering
Weaviate — open-source, built-in vectorization, hybrid search
Milvus — open-source, billion-scale, GPU acceleration

Reranking#

Initial retrieval with ANN search optimizes for speed. Reranking optimizes for precision.

How it works:

Retrieve top-50 candidates via vector similarity (fast, bi-encoder)
Pass each (query, candidate) pair through a cross-encoder
The cross-encoder scores relevance more accurately because it sees both texts together
Return the top-5 reranked results

Tools:

Cohere Rerank API
bge-reranker-v2 (open-source)
Jina Reranker
FlashRank (lightweight, local)

Reranking typically improves answer quality by 10-25% in benchmarks.

Context Window Management#

LLMs have finite context windows. You cannot dump 100 retrieved chunks into a prompt.

Strategies:

Token Budget#

Allocate a fixed token budget for context (e.g., 4K out of 8K total). Fill it with the highest-ranked chunks until the budget is full.

Map-Reduce#

For questions that span many documents, summarize each chunk independently (map), then combine summaries into a final answer (reduce). LangChain supports this via MapReduceDocumentsChain.

Stuffing with Compression#

Use an LLM or an extractive model to compress each chunk to only the relevant sentences before stuffing them into the prompt. LangChain's ContextualCompressionRetriever does this.

Hierarchical Retrieval#

Index both summaries (coarse) and full chunks (fine). First retrieve relevant summaries, then drill into the full chunks of matched summaries. LlamaIndex calls this "recursive retrieval."

Advanced RAG Patterns#

Multi-Query RAG#

Generate multiple reformulations of the user question, retrieve for each, and merge results. This captures different angles of the same question and improves recall.

HyDE (Hypothetical Document Embeddings)#

Ask the LLM to generate a hypothetical answer, embed that answer, and use it as the retrieval query. The hypothetical answer is often closer in embedding space to the real documents than the question itself.

Self-RAG#

The LLM decides whether retrieval is needed, retrieves if so, critiques the retrieved passages, and generates a response. This avoids unnecessary retrieval for simple factual questions.

Corrective RAG (CRAG)#

After retrieval, a grader evaluates whether the retrieved documents are relevant. If not, the system falls back to web search or asks for clarification.

Evaluation Metrics#

RAG has two failure modes: bad retrieval and bad generation. Measure both.

Retrieval Metrics#

Recall@K — what fraction of relevant documents appear in the top-K results?
MRR (Mean Reciprocal Rank) — how high does the first relevant result rank?
NDCG — measures ranking quality accounting for position

Generation Metrics#

Faithfulness — does the answer use only information from the retrieved context? (no hallucination)
Answer relevance — does the answer actually address the question?
Context relevance — are the retrieved chunks relevant to the question?

Evaluation Tools#

RAGAS — open-source framework that scores faithfulness, relevance, and context quality
DeepEval — LLM-based evaluation with customizable metrics
LangSmith — tracing and evaluation platform from LangChain
Phoenix (Arize) — observability and evaluation for LLM applications

Tools and Frameworks#

LangChain#

The most popular RAG framework. Provides document loaders, text splitters, embedding integrations, vector store connectors, and chain abstractions. Supports Python and JavaScript.

LlamaIndex#

Purpose-built for RAG. Excels at data ingestion, advanced indexing (tree, keyword table, knowledge graph), and query engines. Strong support for hierarchical and recursive retrieval.

Haystack#

Open-source by deepset. Pipeline-based architecture with nodes for retrieval, reranking, and generation. Good for production deployments with REST API support.

Common Mistakes#

Chunks too large — retrieval returns irrelevant padding around the useful content
No overlap — critical information at chunk boundaries is lost
Ignoring metadata — filtering by source, date, or category before vector search dramatically improves relevance
Skipping reranking — the jump from top-50 to top-5 matters more than you think
No evaluation — you cannot improve what you do not measure

Start Building#

RAG is the bridge between LLMs and your private data. Get chunking right, pick the right embedding model, add a reranker, and measure everything. The difference between a demo and a production RAG system is in these details.

Design your RAG architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #326 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

Agentic RAG Architecture for Internal Tools

3 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

Try these templates

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Dropbox Cloud Storage Platform

Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.

10 components

AWS S3 Object Storage Architecture

Scalable object storage with versioning, lifecycle policies, replication, access control, and event-driven processing.

10 components

Build this architecture

Generate an interactive Retrieval in seconds.

Try it in Codelit →

retrieval augmented generationRAGLangChainLlamaIndexembeddingsvector search

Retrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation

March 29, 2026 7 min readBy Codelit Team Discussion

Retrieval-Augmented Generation (RAG)#

Large language models know a lot, but they do not know your data. They hallucinate when asked about private documents, recent events, or domain-specific facts.

RAG fixes this by retrieving relevant context from your own data and injecting it into the prompt before generation. The LLM answers based on evidence, not memory.

RAG Architecture Overview#

Every RAG system has two phases:

Indexing (Offline)#

Load — ingest documents from sources (PDFs, databases, APIs, wikis)
Chunk — split documents into retrieval-sized pieces
Embed — convert each chunk into a vector using an embedding model
Store — write vectors and metadata into a vector database

Query (Online)#

Embed the query — convert the user question into a vector
Retrieve — find the top-K most similar chunks via ANN search
Rerank (optional) — re-score results with a cross-encoder for precision
Augment — insert retrieved chunks into the LLM prompt
Generate — the LLM produces an answer grounded in the retrieved context

Chunking Strategies#

Chunking is the most underrated part of RAG. Bad chunks produce bad retrieval, which produces bad answers.

Fixed-Size Chunking#

Split text every N tokens (e.g., 512) with optional overlap (e.g., 50 tokens). Simple and predictable.

Pros: Easy to implement, consistent chunk sizes. Cons: Splits mid-sentence, mid-paragraph, or mid-thought.

Recursive Character Splitting#

Split by paragraph, then sentence, then character — stopping as soon as chunks are under the size limit. This is the default in LangChain.

Pros: Respects natural text boundaries. Cons: Variable chunk sizes can make retrieval scoring inconsistent.

Semantic Chunking#

Use an embedding model to detect topic shifts. Start a new chunk when the embedding similarity between consecutive sentences drops below a threshold.

Pros: Each chunk contains a coherent topic. Cons: Slower to compute, requires an extra embedding pass.

Document-Aware Chunking#

Use document structure — headings, sections, code blocks, tables — to define chunk boundaries. Libraries like Unstructured and LlamaIndex support this.

Pros: Preserves the author's intended structure. Cons: Requires format-specific parsers.

Chunk Size Guidelines#

256-512 tokens — good default for most use cases
128 tokens — better for fine-grained retrieval (Q&A over technical docs)
1024 tokens — better when chunks need surrounding context (legal, medical)
10-20% overlap — prevents losing information at chunk boundaries

Embedding Models#

The embedding model determines retrieval quality. Key options:

Model	Dimensions	Context	Notes
OpenAI text-embedding-3-large	3072	8K tokens	Strong general-purpose, supports dimension reduction
OpenAI text-embedding-3-small	1536	8K tokens	Lower cost, good for most use cases
Cohere embed-v3	1024	512 tokens	Supports search_document and search_query input types
BGE-large-en-v1.5	1024	512 tokens	Open-source, strong MTEB scores
Nomic embed-text-v1.5	768	8K tokens	Open-source, long context, Matryoshka support
GTE-Qwen2	1536	32K tokens	Open-source, very long context

Best practice: Always use the same model for indexing and querying. Mixing models produces incompatible vector spaces.

Vector Store Selection#

Your vector store holds the indexed chunks. Options ranked by complexity:

In-memory (FAISS, NumPy) — prototyping, under 100K chunks
Chroma — lightweight, Python-native, good for local dev
pgvector — vectors alongside relational data in PostgreSQL
Pinecone — fully managed, zero-ops, metadata filtering
Weaviate — open-source, built-in vectorization, hybrid search
Milvus — open-source, billion-scale, GPU acceleration

Reranking#

Initial retrieval with ANN search optimizes for speed. Reranking optimizes for precision.

How it works:

Retrieve top-50 candidates via vector similarity (fast, bi-encoder)
Pass each (query, candidate) pair through a cross-encoder
The cross-encoder scores relevance more accurately because it sees both texts together
Return the top-5 reranked results

Tools:

Cohere Rerank API
bge-reranker-v2 (open-source)
Jina Reranker
FlashRank (lightweight, local)

Reranking typically improves answer quality by 10-25% in benchmarks.

Context Window Management#

LLMs have finite context windows. You cannot dump 100 retrieved chunks into a prompt.

Strategies:

Token Budget#

Allocate a fixed token budget for context (e.g., 4K out of 8K total). Fill it with the highest-ranked chunks until the budget is full.

Map-Reduce#

For questions that span many documents, summarize each chunk independently (map), then combine summaries into a final answer (reduce). LangChain supports this via MapReduceDocumentsChain.

Stuffing with Compression#

Use an LLM or an extractive model to compress each chunk to only the relevant sentences before stuffing them into the prompt. LangChain's ContextualCompressionRetriever does this.

Hierarchical Retrieval#

Index both summaries (coarse) and full chunks (fine). First retrieve relevant summaries, then drill into the full chunks of matched summaries. LlamaIndex calls this "recursive retrieval."

Advanced RAG Patterns#

Multi-Query RAG#

Generate multiple reformulations of the user question, retrieve for each, and merge results. This captures different angles of the same question and improves recall.

HyDE (Hypothetical Document Embeddings)#

Self-RAG#

The LLM decides whether retrieval is needed, retrieves if so, critiques the retrieved passages, and generates a response. This avoids unnecessary retrieval for simple factual questions.

Corrective RAG (CRAG)#

After retrieval, a grader evaluates whether the retrieved documents are relevant. If not, the system falls back to web search or asks for clarification.

Evaluation Metrics#

RAG has two failure modes: bad retrieval and bad generation. Measure both.

Retrieval Metrics#

Recall@K — what fraction of relevant documents appear in the top-K results?
MRR (Mean Reciprocal Rank) — how high does the first relevant result rank?
NDCG — measures ranking quality accounting for position

Generation Metrics#

Faithfulness — does the answer use only information from the retrieved context? (no hallucination)
Answer relevance — does the answer actually address the question?
Context relevance — are the retrieved chunks relevant to the question?

Evaluation Tools#

RAGAS — open-source framework that scores faithfulness, relevance, and context quality
DeepEval — LLM-based evaluation with customizable metrics
LangSmith — tracing and evaluation platform from LangChain
Phoenix (Arize) — observability and evaluation for LLM applications

Tools and Frameworks#

LangChain#

The most popular RAG framework. Provides document loaders, text splitters, embedding integrations, vector store connectors, and chain abstractions. Supports Python and JavaScript.

LlamaIndex#

Purpose-built for RAG. Excels at data ingestion, advanced indexing (tree, keyword table, knowledge graph), and query engines. Strong support for hierarchical and recursive retrieval.

Haystack#

Open-source by deepset. Pipeline-based architecture with nodes for retrieval, reranking, and generation. Good for production deployments with REST API support.

Common Mistakes#

Chunks too large — retrieval returns irrelevant padding around the useful content
No overlap — critical information at chunk boundaries is lost
Ignoring metadata — filtering by source, date, or category before vector search dramatically improves relevance
Skipping reranking — the jump from top-50 to top-5 matters more than you think
No evaluation — you cannot improve what you do not measure

Start Building#

Design your RAG architecture on codelit.io — describe your system, get an interactive architecture diagram, export as code.

Article #326 of 327. Explore all articles, templates, and tools at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

Agentic RAG Architecture for Internal Tools

3 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

Build this architecture

Generate an interactive Retrieval in seconds.

Try it in Codelit →

Retrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation

Retrieval-Augmented Generation (RAG)#

RAG Architecture Overview#

Indexing (Offline)#

Query (Online)#

Chunking Strategies#

Fixed-Size Chunking#

Recursive Character Splitting#

Semantic Chunking#

Document-Aware Chunking#

Chunk Size Guidelines#

Embedding Models#

Vector Store Selection#

Reranking#

Context Window Management#

Token Budget#

Map-Reduce#

Stuffing with Compression#

Hierarchical Retrieval#

Advanced RAG Patterns#

Multi-Query RAG#

HyDE (Hypothetical Document Embeddings)#

Self-RAG#

Corrective RAG (CRAG)#

Evaluation Metrics#

Retrieval Metrics#

Generation Metrics#

Evaluation Tools#

Tools and Frameworks#

LangChain#

LlamaIndex#

Haystack#

Common Mistakes#

Start Building#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

Try these templates

Cloud File Storage Platform

Dropbox Cloud Storage Platform

AWS S3 Object Storage Architecture

Build this architecture

Retrieval-Augmented Generation (RAG) — Architecture, Chunking, and Evaluation

Retrieval-Augmented Generation (RAG)#

RAG Architecture Overview#

Indexing (Offline)#

Query (Online)#

Chunking Strategies#

Fixed-Size Chunking#

Recursive Character Splitting#

Semantic Chunking#

Document-Aware Chunking#

Chunk Size Guidelines#

Embedding Models#

Vector Store Selection#

Reranking#

Context Window Management#

Token Budget#

Map-Reduce#

Stuffing with Compression#

Hierarchical Retrieval#

Advanced RAG Patterns#

Multi-Query RAG#

HyDE (Hypothetical Document Embeddings)#

Self-RAG#

Corrective RAG (CRAG)#

Evaluation Metrics#

Retrieval Metrics#

Generation Metrics#

Evaluation Tools#

Tools and Frameworks#

LangChain#

LlamaIndex#

Haystack#

Common Mistakes#

Start Building#

Comments

Related articles