Search Engine Architecture: How Full-Text Search Really Works
Every time a user types a query and gets results in milliseconds, a sophisticated pipeline of crawling, indexing, and ranking is at work. Understanding search engine architecture is essential for building fast, relevant search experiences at scale.
How Search Works: Crawl, Index, Rank#
All search systems follow three core phases:
- Crawl — Discover and fetch content (documents, pages, records).
- Index — Analyze and store content in a structure optimized for retrieval.
- Rank — Score and order results by relevance to the query.
Web search engines crawl billions of pages. Internal search systems ingest database records, product catalogs, or log entries. The architecture is the same.
The Inverted Index#
The inverted index is the foundational data structure behind full-text search. Instead of mapping documents to words, it maps each term to the list of documents containing it:
"kubernetes" → [doc_3, doc_17, doc_42]
"architecture" → [doc_3, doc_8, doc_17]
"search" → [doc_1, doc_3, doc_8, doc_42]
A query for "kubernetes architecture" intersects the posting lists to find doc_3 and doc_17. This is how engines return results in constant or near-constant time regardless of corpus size.
Tokenization, Stemming, and Analyzers#
Before text enters the index, it passes through an analysis pipeline:
- Tokenizer — splits text into tokens (
"full-text search"→["full", "text", "search"]). - Lowercasing — normalizes case.
- Stop-word removal — drops common words like "the", "is", "and".
- Stemming / Lemmatization — reduces words to roots (
"running"→"run").
In Elasticsearch architecture, this is configured per field:
{
"settings": {
"analysis": {
"analyzer": {
"blog_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop", "snowball"]
}
}
}
},
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "blog_analyzer" },
"body": { "type": "text", "analyzer": "blog_analyzer" },
"slug": { "type": "keyword" }
}
}
}
Relevance Scoring: TF-IDF and BM25#
Search ranking determines result order. Two models dominate:
TF-IDF#
- Term Frequency (TF) — how often a term appears in a document.
- Inverse Document Frequency (IDF) — how rare the term is across all documents.
- Score = TF × IDF. Rare terms in a document get higher weight.
BM25#
BM25 is the default in Elasticsearch and Solr. It improves on TF-IDF with saturation (diminishing returns for repeated terms) and document-length normalization:
score(D, Q) = Σ IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))
Typical defaults: k1 = 1.2, b = 0.75. You rarely need to tune these unless your documents vary wildly in length.
Faceted Search#
Faceted search lets users filter results by categories — price range, brand, date, status. It requires maintaining aggregation-friendly data alongside the inverted index.
{
"query": { "match": { "body": "search engine architecture" } },
"aggs": {
"by_category": { "terms": { "field": "category.keyword" } },
"by_year": { "date_histogram": { "field": "date", "calendar_interval": "year" } }
}
}
Facets are computed in a single pass during query execution — no extra round trip.
Autocomplete and Typeahead#
Fast autocomplete requires specialized data structures:
- Prefix queries on keyword fields — simple but limited.
- Edge n-gram tokenizer — indexes prefixes at write time (
"arch"→["a", "ar", "arc", "arch"]). - Completion suggester (Elasticsearch) — uses an in-memory FST for sub-millisecond suggestions.
{
"mappings": {
"properties": {
"suggest": {
"type": "completion"
}
}
}
}
Query with:
{
"suggest": {
"title-suggest": {
"prefix": "searc",
"completion": { "field": "suggest", "size": 5 }
}
}
}
Distributed Search: Shards and Replicas#
At scale, a single node cannot hold the entire index. Distributed search splits data across shards and copies them as replicas:
| Concept | Purpose |
|---|---|
| Primary shard | Holds a partition of the index |
| Replica shard | Copy of a primary for fault tolerance and read throughput |
| Coordinator node | Receives the query, fans it out, merges results |
A query against a 5-shard index runs in parallel on all 5 shards. The coordinator merges the top-N results — a scatter-gather pattern.
Shard sizing rule of thumb: 10–50 GB per shard. Too many small shards waste overhead; too few large shards slow queries.
Solr vs Elasticsearch#
Both are built on Apache Lucene. Key differences in the Solr vs Elasticsearch debate:
| Elasticsearch | Solr | |
|---|---|---|
| Config | REST API, JSON | XML config files |
| Cluster management | Built-in | Requires ZooKeeper |
| Real-time indexing | Near real-time by default | Requires soft commits |
| Analytics | Strong (aggregations) | Comparable (facets, pivots) |
| Community | Larger ecosystem | Mature, stable |
For new projects, Elasticsearch (or OpenSearch) is the more common choice.
Modern Search Tools#
The landscape has expanded beyond Lucene-based engines:
- Elasticsearch / OpenSearch — the industry standard for log analytics and full-text search. OpenSearch is the Apache-licensed fork.
- Meilisearch — Rust-based, typo-tolerant, instant search. Great for product catalogs and documentation.
- Typesense — C++-based, easy to operate, built-in typo tolerance and geo-search.
- Algolia — hosted search-as-a-service with excellent frontend SDKs. Higher cost at scale.
Quick Typesense example#
# Create collection
curl -X POST 'http://localhost:8108/collections' \
-H 'X-TYPESENSE-API-KEY: xyz' \
-d '{
"name": "articles",
"fields": [
{ "name": "title", "type": "string" },
{ "name": "body", "type": "string" },
{ "name": "tags", "type": "string[]", "facet": true }
],
"default_sorting_field": ""
}'
# Search with typo tolerance
curl 'http://localhost:8108/collections/articles/documents/search?q=elastcsearch&query_by=title,body&facet_by=tags'
Search Architecture Patterns#
Pattern 1: Dual-Write (Simple)#
Application writes to both the primary database and the search index. Risk: inconsistency if one write fails.
Pattern 2: Change Data Capture (Robust)#
A CDC pipeline (Debezium, DynamoDB Streams) tails the database log and pushes changes to the search index. Guarantees eventual consistency.
Pattern 3: Event-Driven#
Producers emit domain events. A search indexer consumer processes events and updates the index. Decoupled and scalable.
[App] → [Kafka / SQS] → [Indexer Service] → [Elasticsearch]
↑
CDC from Postgres
Key Takeaways#
- The inverted index is the core of all full-text search.
- BM25 handles relevance scoring well out of the box — tune analyzers before touching scoring parameters.
- Shard carefully: over-sharding is the most common Elasticsearch mistake.
- Use CDC or event-driven patterns to keep search indexes in sync.
- Evaluate Meilisearch and Typesense for simpler use cases — they reduce operational burden significantly.
Search is one of those systems that looks simple on the surface but rewards deep architectural understanding. Get it right, and users never think about it. Get it wrong, and they leave.
Build search-driven systems and more — explore hands-on system design content at codelit.io.
144 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSlack-like Team Messaging
Workspace-based team messaging with channels, threads, file sharing, and integrations.
9 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive Search Engine Architecture in seconds.
Try it in Codelit →
Comments