AI searchsemantic searchhybrid searchvector searchRAGsystem design

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

March 29, 2026 8 min readBy Codelit Team Discussion

AI-Powered Search Architecture#

Traditional keyword search breaks when users describe what they want instead of typing exact terms. AI-powered search understands meaning, not just matches. This guide covers the architecture patterns for building modern search systems.

The Search Evolution#

Search Generations:

  Gen 1: Keyword Match
    "running shoes" → matches documents containing "running" AND "shoes"

  Gen 2: Keyword + Relevance Scoring
    TF-IDF, BM25 → ranks by term frequency and importance

  Gen 3: Semantic Search
    "comfortable shoes for jogging" → matches "lightweight running sneakers"

  Gen 4: Hybrid Search + AI
    Combines keyword precision with semantic understanding + reranking

Most production systems today need Gen 4. Pure keyword search misses intent. Pure semantic search misses exact matches. Hybrid search gives you both.

Semantic Search#

Semantic search converts queries and documents into vector embeddings, then finds documents whose vectors are closest to the query vector.

How It Works#

Indexing:
  Document → Embedding Model → Vector [0.12, -0.45, 0.78, ...] → Vector Index

Query:
  Query → Embedding Model → Vector [0.11, -0.42, 0.80, ...]
    → Nearest Neighbor Search
    → Top-K Results

Embedding Models#

Model	Dimensions	Best For
OpenAI text-embedding-3-small	1536	General purpose
OpenAI text-embedding-3-large	3072	High accuracy
Cohere embed-v3	1024	Multilingual
BGE / GTE (open source)	768-1024	Self-hosted
Sentence Transformers	384-768	Lightweight

Chunking Strategy#

Documents must be split into chunks before embedding. Chunk size directly impacts search quality.

Chunking Approaches:

  Fixed size: Split every 512 tokens (simple but breaks context)
  Sentence: Split on sentence boundaries (preserves meaning)
  Paragraph: Split on paragraph boundaries (preserves topic)
  Semantic: Split when topic shifts (best quality, highest cost)
  Hierarchical: Multiple chunk sizes, search across all levels

Best practice: 256-512 tokens per chunk with 50-100 token overlap between chunks.

Distance Metrics#

Cosine similarity: Most common, works well for normalized embeddings
Dot product: Faster, works when magnitude matters
Euclidean distance: Useful when absolute position in vector space matters

Hybrid Search#

Hybrid search combines keyword search (BM25) with semantic search (vectors) to get the best of both approaches.

Why Hybrid Wins#

Query: "error code 0x80070005"

  Keyword search: Exact match on error code → High precision
  Semantic search: Might miss the exact code → Lower precision

Query: "my app crashes when I try to save files"

  Keyword search: Matches "crash" and "save" → Misses related docs
  Semantic search: Understands intent → Finds "file write permission error"

Hybrid: Gets both right.

Fusion Strategies#

Combining scores from keyword and semantic search requires a fusion strategy.

Fusion Methods:

  1. Linear combination:
     score = alpha * bm25_score + (1 - alpha) * vector_score
     (alpha typically 0.3-0.7, tune on your data)

  2. Reciprocal Rank Fusion (RRF):
     score = sum(1 / (k + rank_i)) for each retrieval system
     (k typically 60, no score normalization needed)

  3. Learned fusion:
     Train a model to combine scores based on query type

RRF is the most popular because it does not require score normalization and works well out of the box.

Architecture#

Hybrid Search Architecture:

  Query
    → [Query Processor]
    → ┌─ [BM25 Index] → keyword results
    → └─ [Vector Index] → semantic results
    → [Fusion Layer (RRF)]
    → [Reranker]
    → Final Results

Reranking#

Reranking is a second-stage model that takes the top results from retrieval and re-scores them with a more powerful (and expensive) model.

Why Rerank#

Retrieval models (BM25, bi-encoder embeddings) are fast but approximate
Reranking models (cross-encoders) are slow but accurate
The two-stage pipeline gives you speed and quality

Two-Stage Pipeline:

  Stage 1 (Retrieval): Fast, process millions of documents
    → Return top 100-200 candidates

  Stage 2 (Reranking): Slow, process only candidates
    → Return top 10-20 final results

Reranking Models#

Model	Type	Latency
Cohere Rerank	API	Low
Jina Reranker	API / Self-hosted	Low
BGE Reranker	Open source	Medium
Cross-encoder (custom)	Self-hosted	Medium
LLM-based reranking	API	High

LLM-Based Reranking#

You can use an LLM to rerank by asking it to score relevance or sort results. This is expensive but powerful for complex queries.

LLM Reranking Prompt:
  "Given the query and these 20 search results,
   rank them by relevance. Return the indices
   in order from most to least relevant."

Query Understanding#

Query understanding transforms the raw user query into a better search query before retrieval.

Query Processing Pipeline#

Query Understanding:

  Raw query: "how do I fix the thing that keeps crashing"
    → [Intent Classification] → troubleshooting
    → [Query Expansion] → "fix crash error resolution"
    → [Entity Extraction] → (no specific entity)
    → [Query Rewriting] → "troubleshoot application crash"
    → Enhanced query sent to retrieval

Techniques#

Query expansion: Add synonyms and related terms
Query rewriting: Use an LLM to reformulate vague queries
Spell correction: Fix typos before search
Intent classification: Route queries to specialized indexes
Entity extraction: Identify product names, error codes, versions
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed that instead of the query

HyDE Pattern#

HyDE Flow:

  Query: "Why does my Docker container keep restarting?"
    → LLM generates hypothetical answer:
      "Docker containers restart due to OOM kills,
       crash loops, or restart policy configuration..."
    → Embed the hypothetical answer
    → Search with that embedding
    → Better retrieval than embedding the question directly

Faceted Search + Semantic#

Traditional faceted search (filter by category, price, date) can be combined with semantic search for powerful filtered retrieval.

Architecture#

Faceted Semantic Search:

  Query: "lightweight laptop" + Filters: {price: "$500-1000", brand: "Dell"}
    → [Pre-filter]: Apply facet filters to reduce candidate set
    → [Semantic Search]: Vector search within filtered set
    → [Post-filter]: Apply any remaining filters
    → Results

Pre-filter vs Post-filter#

Pre-filter: Apply facets before vector search. Faster but requires indexed metadata in the vector store.
Post-filter: Run vector search first, then filter. Simpler but may return fewer results than requested.

Most vector databases support metadata filtering during search, making pre-filter the preferred approach.

Tools and Infrastructure#

Elasticsearch + Vectors#

Elasticsearch added dense vector support, making it a strong hybrid search platform.

Elasticsearch Hybrid Search:
  - BM25 for keyword search (native)
  - kNN for vector search (dense_vector field type)
  - RRF fusion built-in
  - Existing ecosystem: monitoring, scaling, security

Vespa#

Vespa is built for hybrid search from the ground up.

Vespa Capabilities:
  - Native hybrid search (BM25 + ANN)
  - Built-in reranking with ONNX models
  - Real-time indexing
  - Multi-phase ranking pipelines
  - Handles billions of documents

Pinecone#

Pinecone is a managed vector database optimized for similarity search.

Pinecone Features:
  - Fully managed, no infrastructure to run
  - Metadata filtering during search
  - Namespace isolation
  - High throughput, low latency
  - Sparse-dense vectors for hybrid search

Comparison#

Feature	Elasticsearch	Vespa	Pinecone
Hybrid search	Yes (RRF)	Yes (native)	Yes (sparse-dense)
Self-hosted	Yes	Yes	No
Managed option	Yes (Elastic Cloud)	Yes (Vespa Cloud)	Yes (only)
Reranking	Plugin-based	Native ONNX	External
Scale	Billions of docs	Billions of docs	Billions of vectors

RAG for Search#

Retrieval-Augmented Generation (RAG) combines search with LLM generation. Instead of returning a list of links, RAG returns a synthesized answer grounded in retrieved documents.

RAG Architecture#

RAG Search Pipeline:

  Query
    → [Query Understanding]
    → [Hybrid Retrieval]
    → [Reranking]
    → Top-K Documents
    → [LLM Generation with Context]
    → Synthesized Answer + Citations

RAG Best Practices#

Always cite sources: Include references to retrieved documents in the answer
Chunk attribution: Track which chunks contributed to which parts of the answer
Faithfulness checks: Verify the answer does not add information beyond the retrieved context
Fallback to search results: If the LLM cannot synthesize a good answer, show traditional results
Streaming: Stream the generated answer for better perceived latency

Common RAG Failures#

Context window overflow: Too many retrieved chunks. Solution: better reranking and truncation.
Lost in the middle: LLM ignores chunks in the middle of the context. Solution: put most relevant chunks first and last.
Hallucinated citations: LLM invents source references. Solution: post-process citations against actual retrieved docs.
Stale data: Embeddings are out of date. Solution: incremental re-indexing pipeline.

Architecture Decision Guide#

Use Case	Approach
Exact match (error codes, SKUs)	Keyword search
Natural language queries	Semantic search
Mixed queries	Hybrid search
High-precision requirements	Hybrid + reranking
Answer synthesis	RAG
Large catalog with filters	Faceted + semantic

Start with hybrid search and RRF fusion. Add reranking when you need more precision. Add RAG when users want answers, not links.

Build smarter search at codelit.io.

330 articles and guides at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

Agentic RAG Architecture for Internal Tools

3 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Dropbox Cloud Storage Platform

Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.

10 components

Build this architecture

Generate an interactive AI in seconds.

Try it in Codelit →

AI searchsemantic searchhybrid searchvector searchRAGsystem design

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

March 29, 2026 8 min readBy Codelit Team Discussion

AI-Powered Search Architecture#

The Search Evolution#

Search Generations:

  Gen 1: Keyword Match
    "running shoes" → matches documents containing "running" AND "shoes"

  Gen 2: Keyword + Relevance Scoring
    TF-IDF, BM25 → ranks by term frequency and importance

  Gen 3: Semantic Search
    "comfortable shoes for jogging" → matches "lightweight running sneakers"

  Gen 4: Hybrid Search + AI
    Combines keyword precision with semantic understanding + reranking

Most production systems today need Gen 4. Pure keyword search misses intent. Pure semantic search misses exact matches. Hybrid search gives you both.

Semantic Search#

Semantic search converts queries and documents into vector embeddings, then finds documents whose vectors are closest to the query vector.

How It Works#

Indexing:
  Document → Embedding Model → Vector [0.12, -0.45, 0.78, ...] → Vector Index

Query:
  Query → Embedding Model → Vector [0.11, -0.42, 0.80, ...]
    → Nearest Neighbor Search
    → Top-K Results

Embedding Models#

Model	Dimensions	Best For
OpenAI text-embedding-3-small	1536	General purpose
OpenAI text-embedding-3-large	3072	High accuracy
Cohere embed-v3	1024	Multilingual
BGE / GTE (open source)	768-1024	Self-hosted
Sentence Transformers	384-768	Lightweight

Chunking Strategy#

Documents must be split into chunks before embedding. Chunk size directly impacts search quality.

Chunking Approaches:

  Fixed size: Split every 512 tokens (simple but breaks context)
  Sentence: Split on sentence boundaries (preserves meaning)
  Paragraph: Split on paragraph boundaries (preserves topic)
  Semantic: Split when topic shifts (best quality, highest cost)
  Hierarchical: Multiple chunk sizes, search across all levels

Best practice: 256-512 tokens per chunk with 50-100 token overlap between chunks.

Distance Metrics#

Cosine similarity: Most common, works well for normalized embeddings
Dot product: Faster, works when magnitude matters
Euclidean distance: Useful when absolute position in vector space matters

Hybrid Search#

Hybrid search combines keyword search (BM25) with semantic search (vectors) to get the best of both approaches.

Why Hybrid Wins#

Query: "error code 0x80070005"

  Keyword search: Exact match on error code → High precision
  Semantic search: Might miss the exact code → Lower precision

Query: "my app crashes when I try to save files"

  Keyword search: Matches "crash" and "save" → Misses related docs
  Semantic search: Understands intent → Finds "file write permission error"

Hybrid: Gets both right.

Fusion Strategies#

Combining scores from keyword and semantic search requires a fusion strategy.

Fusion Methods:

  1. Linear combination:
     score = alpha * bm25_score + (1 - alpha) * vector_score
     (alpha typically 0.3-0.7, tune on your data)

  2. Reciprocal Rank Fusion (RRF):
     score = sum(1 / (k + rank_i)) for each retrieval system
     (k typically 60, no score normalization needed)

  3. Learned fusion:
     Train a model to combine scores based on query type

RRF is the most popular because it does not require score normalization and works well out of the box.

Architecture#

Hybrid Search Architecture:

  Query
    → [Query Processor]
    → ┌─ [BM25 Index] → keyword results
    → └─ [Vector Index] → semantic results
    → [Fusion Layer (RRF)]
    → [Reranker]
    → Final Results

Reranking#

Reranking is a second-stage model that takes the top results from retrieval and re-scores them with a more powerful (and expensive) model.

Why Rerank#

Retrieval models (BM25, bi-encoder embeddings) are fast but approximate
Reranking models (cross-encoders) are slow but accurate
The two-stage pipeline gives you speed and quality

Two-Stage Pipeline:

  Stage 1 (Retrieval): Fast, process millions of documents
    → Return top 100-200 candidates

  Stage 2 (Reranking): Slow, process only candidates
    → Return top 10-20 final results

Reranking Models#

Model	Type	Latency
Cohere Rerank	API	Low
Jina Reranker	API / Self-hosted	Low
BGE Reranker	Open source	Medium
Cross-encoder (custom)	Self-hosted	Medium
LLM-based reranking	API	High

LLM-Based Reranking#

You can use an LLM to rerank by asking it to score relevance or sort results. This is expensive but powerful for complex queries.

LLM Reranking Prompt:
  "Given the query and these 20 search results,
   rank them by relevance. Return the indices
   in order from most to least relevant."

Query Understanding#

Query understanding transforms the raw user query into a better search query before retrieval.

Query Processing Pipeline#

Query Understanding:

  Raw query: "how do I fix the thing that keeps crashing"
    → [Intent Classification] → troubleshooting
    → [Query Expansion] → "fix crash error resolution"
    → [Entity Extraction] → (no specific entity)
    → [Query Rewriting] → "troubleshoot application crash"
    → Enhanced query sent to retrieval

Techniques#

Query expansion: Add synonyms and related terms
Query rewriting: Use an LLM to reformulate vague queries
Spell correction: Fix typos before search
Intent classification: Route queries to specialized indexes
Entity extraction: Identify product names, error codes, versions
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed that instead of the query

HyDE Pattern#

HyDE Flow:

  Query: "Why does my Docker container keep restarting?"
    → LLM generates hypothetical answer:
      "Docker containers restart due to OOM kills,
       crash loops, or restart policy configuration..."
    → Embed the hypothetical answer
    → Search with that embedding
    → Better retrieval than embedding the question directly

Faceted Search + Semantic#

Traditional faceted search (filter by category, price, date) can be combined with semantic search for powerful filtered retrieval.

Architecture#

Faceted Semantic Search:

  Query: "lightweight laptop" + Filters: {price: "$500-1000", brand: "Dell"}
    → [Pre-filter]: Apply facet filters to reduce candidate set
    → [Semantic Search]: Vector search within filtered set
    → [Post-filter]: Apply any remaining filters
    → Results

Pre-filter vs Post-filter#

Pre-filter: Apply facets before vector search. Faster but requires indexed metadata in the vector store.
Post-filter: Run vector search first, then filter. Simpler but may return fewer results than requested.

Most vector databases support metadata filtering during search, making pre-filter the preferred approach.

Tools and Infrastructure#

Elasticsearch + Vectors#

Elasticsearch added dense vector support, making it a strong hybrid search platform.

Elasticsearch Hybrid Search:
  - BM25 for keyword search (native)
  - kNN for vector search (dense_vector field type)
  - RRF fusion built-in
  - Existing ecosystem: monitoring, scaling, security

Vespa#

Vespa is built for hybrid search from the ground up.

Vespa Capabilities:
  - Native hybrid search (BM25 + ANN)
  - Built-in reranking with ONNX models
  - Real-time indexing
  - Multi-phase ranking pipelines
  - Handles billions of documents

Pinecone#

Pinecone is a managed vector database optimized for similarity search.

Pinecone Features:
  - Fully managed, no infrastructure to run
  - Metadata filtering during search
  - Namespace isolation
  - High throughput, low latency
  - Sparse-dense vectors for hybrid search

Comparison#

Feature	Elasticsearch	Vespa	Pinecone
Hybrid search	Yes (RRF)	Yes (native)	Yes (sparse-dense)
Self-hosted	Yes	Yes	No
Managed option	Yes (Elastic Cloud)	Yes (Vespa Cloud)	Yes (only)
Reranking	Plugin-based	Native ONNX	External
Scale	Billions of docs	Billions of docs	Billions of vectors

RAG for Search#

Retrieval-Augmented Generation (RAG) combines search with LLM generation. Instead of returning a list of links, RAG returns a synthesized answer grounded in retrieved documents.

RAG Architecture#

RAG Search Pipeline:

  Query
    → [Query Understanding]
    → [Hybrid Retrieval]
    → [Reranking]
    → Top-K Documents
    → [LLM Generation with Context]
    → Synthesized Answer + Citations

RAG Best Practices#

Always cite sources: Include references to retrieved documents in the answer
Chunk attribution: Track which chunks contributed to which parts of the answer
Faithfulness checks: Verify the answer does not add information beyond the retrieved context
Fallback to search results: If the LLM cannot synthesize a good answer, show traditional results
Streaming: Stream the generated answer for better perceived latency

Common RAG Failures#

Context window overflow: Too many retrieved chunks. Solution: better reranking and truncation.
Lost in the middle: LLM ignores chunks in the middle of the context. Solution: put most relevant chunks first and last.
Hallucinated citations: LLM invents source references. Solution: post-process citations against actual retrieved docs.
Stale data: Embeddings are out of date. Solution: incremental re-indexing pipeline.

Architecture Decision Guide#

Use Case	Approach
Exact match (error codes, SKUs)	Keyword search
Natural language queries	Semantic search
Mixed queries	Hybrid search
High-precision requirements	Hybrid + reranking
Answer synthesis	RAG
Large catalog with filters	Faceted + semantic

Start with hybrid search and RRF fusion. Add reranking when you need more precision. Add RAG when users want answers, not links.

Build smarter search at codelit.io.

330 articles and guides at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive AI in seconds.

Try it in Codelit →

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI-Powered Search Architecture#

The Search Evolution#

Semantic Search#

How It Works#

Embedding Models#

Chunking Strategy#

Distance Metrics#

Hybrid Search#

Why Hybrid Wins#

Fusion Strategies#

Architecture#

Reranking#

Why Rerank#

Reranking Models#

LLM-Based Reranking#

Query Understanding#

Query Processing Pipeline#

Techniques#

HyDE Pattern#

Faceted Search + Semantic#

Architecture#

Pre-filter vs Post-filter#

Tools and Infrastructure#

Elasticsearch + Vectors#

Vespa#

Pinecone#

Comparison#

RAG for Search#

RAG Architecture#

RAG Best Practices#

Common RAG Failures#

Architecture Decision Guide#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

Cloud File Storage Platform

Search Engine Architecture

Dropbox Cloud Storage Platform

Build this architecture

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI-Powered Search Architecture#

The Search Evolution#

Semantic Search#

How It Works#

Embedding Models#

Chunking Strategy#

Distance Metrics#

Hybrid Search#

Why Hybrid Wins#

Fusion Strategies#

Architecture#

Reranking#

Why Rerank#

Reranking Models#

LLM-Based Reranking#

Query Understanding#

Query Processing Pipeline#

Techniques#

HyDE Pattern#

Faceted Search + Semantic#

Architecture#

Pre-filter vs Post-filter#

Tools and Infrastructure#

Elasticsearch + Vectors#

Vespa#

Pinecone#

Comparison#

RAG for Search#

RAG Architecture#

RAG Best Practices#

Common RAG Failures#

Architecture Decision Guide#

Comments

Related articles

Context Engineering for Agentic Systems

Agentic RAG Architecture for Internal Tools