AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
AI-Powered Search Architecture#
Traditional keyword search breaks when users describe what they want instead of typing exact terms. AI-powered search understands meaning, not just matches. This guide covers the architecture patterns for building modern search systems.
The Search Evolution#
Search Generations:
Gen 1: Keyword Match
"running shoes" → matches documents containing "running" AND "shoes"
Gen 2: Keyword + Relevance Scoring
TF-IDF, BM25 → ranks by term frequency and importance
Gen 3: Semantic Search
"comfortable shoes for jogging" → matches "lightweight running sneakers"
Gen 4: Hybrid Search + AI
Combines keyword precision with semantic understanding + reranking
Most production systems today need Gen 4. Pure keyword search misses intent. Pure semantic search misses exact matches. Hybrid search gives you both.
Semantic Search#
Semantic search converts queries and documents into vector embeddings, then finds documents whose vectors are closest to the query vector.
How It Works#
Indexing:
Document → Embedding Model → Vector [0.12, -0.45, 0.78, ...] → Vector Index
Query:
Query → Embedding Model → Vector [0.11, -0.42, 0.80, ...]
→ Nearest Neighbor Search
→ Top-K Results
Embedding Models#
| Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | General purpose |
| OpenAI text-embedding-3-large | 3072 | High accuracy |
| Cohere embed-v3 | 1024 | Multilingual |
| BGE / GTE (open source) | 768-1024 | Self-hosted |
| Sentence Transformers | 384-768 | Lightweight |
Chunking Strategy#
Documents must be split into chunks before embedding. Chunk size directly impacts search quality.
Chunking Approaches:
Fixed size: Split every 512 tokens (simple but breaks context)
Sentence: Split on sentence boundaries (preserves meaning)
Paragraph: Split on paragraph boundaries (preserves topic)
Semantic: Split when topic shifts (best quality, highest cost)
Hierarchical: Multiple chunk sizes, search across all levels
Best practice: 256-512 tokens per chunk with 50-100 token overlap between chunks.
Distance Metrics#
- Cosine similarity: Most common, works well for normalized embeddings
- Dot product: Faster, works when magnitude matters
- Euclidean distance: Useful when absolute position in vector space matters
Hybrid Search#
Hybrid search combines keyword search (BM25) with semantic search (vectors) to get the best of both approaches.
Why Hybrid Wins#
Query: "error code 0x80070005"
Keyword search: Exact match on error code → High precision
Semantic search: Might miss the exact code → Lower precision
Query: "my app crashes when I try to save files"
Keyword search: Matches "crash" and "save" → Misses related docs
Semantic search: Understands intent → Finds "file write permission error"
Hybrid: Gets both right.
Fusion Strategies#
Combining scores from keyword and semantic search requires a fusion strategy.
Fusion Methods:
1. Linear combination:
score = alpha * bm25_score + (1 - alpha) * vector_score
(alpha typically 0.3-0.7, tune on your data)
2. Reciprocal Rank Fusion (RRF):
score = sum(1 / (k + rank_i)) for each retrieval system
(k typically 60, no score normalization needed)
3. Learned fusion:
Train a model to combine scores based on query type
RRF is the most popular because it does not require score normalization and works well out of the box.
Architecture#
Hybrid Search Architecture:
Query
→ [Query Processor]
→ ┌─ [BM25 Index] → keyword results
→ └─ [Vector Index] → semantic results
→ [Fusion Layer (RRF)]
→ [Reranker]
→ Final Results
Reranking#
Reranking is a second-stage model that takes the top results from retrieval and re-scores them with a more powerful (and expensive) model.
Why Rerank#
- Retrieval models (BM25, bi-encoder embeddings) are fast but approximate
- Reranking models (cross-encoders) are slow but accurate
- The two-stage pipeline gives you speed and quality
Two-Stage Pipeline:
Stage 1 (Retrieval): Fast, process millions of documents
→ Return top 100-200 candidates
Stage 2 (Reranking): Slow, process only candidates
→ Return top 10-20 final results
Reranking Models#
| Model | Type | Latency |
|---|---|---|
| Cohere Rerank | API | Low |
| Jina Reranker | API / Self-hosted | Low |
| BGE Reranker | Open source | Medium |
| Cross-encoder (custom) | Self-hosted | Medium |
| LLM-based reranking | API | High |
LLM-Based Reranking#
You can use an LLM to rerank by asking it to score relevance or sort results. This is expensive but powerful for complex queries.
LLM Reranking Prompt:
"Given the query and these 20 search results,
rank them by relevance. Return the indices
in order from most to least relevant."
Query Understanding#
Query understanding transforms the raw user query into a better search query before retrieval.
Query Processing Pipeline#
Query Understanding:
Raw query: "how do I fix the thing that keeps crashing"
→ [Intent Classification] → troubleshooting
→ [Query Expansion] → "fix crash error resolution"
→ [Entity Extraction] → (no specific entity)
→ [Query Rewriting] → "troubleshoot application crash"
→ Enhanced query sent to retrieval
Techniques#
- Query expansion: Add synonyms and related terms
- Query rewriting: Use an LLM to reformulate vague queries
- Spell correction: Fix typos before search
- Intent classification: Route queries to specialized indexes
- Entity extraction: Identify product names, error codes, versions
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed that instead of the query
HyDE Pattern#
HyDE Flow:
Query: "Why does my Docker container keep restarting?"
→ LLM generates hypothetical answer:
"Docker containers restart due to OOM kills,
crash loops, or restart policy configuration..."
→ Embed the hypothetical answer
→ Search with that embedding
→ Better retrieval than embedding the question directly
Faceted Search + Semantic#
Traditional faceted search (filter by category, price, date) can be combined with semantic search for powerful filtered retrieval.
Architecture#
Faceted Semantic Search:
Query: "lightweight laptop" + Filters: {price: "$500-1000", brand: "Dell"}
→ [Pre-filter]: Apply facet filters to reduce candidate set
→ [Semantic Search]: Vector search within filtered set
→ [Post-filter]: Apply any remaining filters
→ Results
Pre-filter vs Post-filter#
- Pre-filter: Apply facets before vector search. Faster but requires indexed metadata in the vector store.
- Post-filter: Run vector search first, then filter. Simpler but may return fewer results than requested.
Most vector databases support metadata filtering during search, making pre-filter the preferred approach.
Tools and Infrastructure#
Elasticsearch + Vectors#
Elasticsearch added dense vector support, making it a strong hybrid search platform.
Elasticsearch Hybrid Search:
- BM25 for keyword search (native)
- kNN for vector search (dense_vector field type)
- RRF fusion built-in
- Existing ecosystem: monitoring, scaling, security
Vespa#
Vespa is built for hybrid search from the ground up.
Vespa Capabilities:
- Native hybrid search (BM25 + ANN)
- Built-in reranking with ONNX models
- Real-time indexing
- Multi-phase ranking pipelines
- Handles billions of documents
Pinecone#
Pinecone is a managed vector database optimized for similarity search.
Pinecone Features:
- Fully managed, no infrastructure to run
- Metadata filtering during search
- Namespace isolation
- High throughput, low latency
- Sparse-dense vectors for hybrid search
Comparison#
| Feature | Elasticsearch | Vespa | Pinecone |
|---|---|---|---|
| Hybrid search | Yes (RRF) | Yes (native) | Yes (sparse-dense) |
| Self-hosted | Yes | Yes | No |
| Managed option | Yes (Elastic Cloud) | Yes (Vespa Cloud) | Yes (only) |
| Reranking | Plugin-based | Native ONNX | External |
| Scale | Billions of docs | Billions of docs | Billions of vectors |
RAG for Search#
Retrieval-Augmented Generation (RAG) combines search with LLM generation. Instead of returning a list of links, RAG returns a synthesized answer grounded in retrieved documents.
RAG Architecture#
RAG Search Pipeline:
Query
→ [Query Understanding]
→ [Hybrid Retrieval]
→ [Reranking]
→ Top-K Documents
→ [LLM Generation with Context]
→ Synthesized Answer + Citations
RAG Best Practices#
- Always cite sources: Include references to retrieved documents in the answer
- Chunk attribution: Track which chunks contributed to which parts of the answer
- Faithfulness checks: Verify the answer does not add information beyond the retrieved context
- Fallback to search results: If the LLM cannot synthesize a good answer, show traditional results
- Streaming: Stream the generated answer for better perceived latency
Common RAG Failures#
- Context window overflow: Too many retrieved chunks. Solution: better reranking and truncation.
- Lost in the middle: LLM ignores chunks in the middle of the context. Solution: put most relevant chunks first and last.
- Hallucinated citations: LLM invents source references. Solution: post-process citations against actual retrieved docs.
- Stale data: Embeddings are out of date. Solution: incremental re-indexing pipeline.
Architecture Decision Guide#
| Use Case | Approach |
|---|---|
| Exact match (error codes, SKUs) | Keyword search |
| Natural language queries | Semantic search |
| Mixed queries | Hybrid search |
| High-precision requirements | Hybrid + reranking |
| Answer synthesis | RAG |
| Large catalog with filters | Faceted + semantic |
Start with hybrid search and RRF fusion. Add reranking when you need more precision. Add RAG when users want answers, not links.
Build smarter search at codelit.io.
330 articles and guides at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsDropbox Cloud Storage Platform
Cloud file storage and sync with real-time collaboration, versioning, sharing, and cross-device sync.
10 components
Comments