Autocomplete System Design: Building a Typeahead Search Engine
When you type a few characters into Google, Amazon, or Spotify and instantly see relevant suggestions, you're interacting with an autocomplete system. Designing one that serves millions of queries per second with sub-50ms latency is a classic system design challenge. This guide breaks down every layer — from the trie data structure to distributed caching and fuzzy matching.
Core Data Structure: The Trie#
A trie (prefix tree) is the foundational data structure for autocomplete. Each node represents a character, and paths from root to leaf spell out stored strings.
root
/ | \
h s b
| | |
e e u
| | |
l a y
| |
p r
|
c
|
h
Why Tries Over Hash Maps?#
- Prefix matching is O(L) where L is the prefix length — no need to scan every entry.
- Memory can be optimized with compressed tries (radix trees) that merge single-child chains.
- Naturally supports ranked traversal when nodes store aggregated scores.
Node Structure#
Each trie node typically stores:
- A map of child characters to child nodes
- A boolean indicating whether this node completes a valid term
- A score or frequency counter for ranking
- Optionally, a precomputed list of top-K suggestions from this prefix
Ranking Suggestions#
Raw prefix matching returns too many results. Ranking narrows them to the most useful.
Frequency-Based Ranking#
Count how often each query has been searched. Store cumulative frequency at each node and propagate scores upward so parent nodes can quickly return the highest-scoring completions.
Recency Weighting#
Apply a time-decay function so recent queries rank higher:
score = raw_frequency * decay_factor ^ (now - last_seen)
A common decay factor is 0.99 per hour, which halves a score roughly every three days.
Personalization#
Layer user-specific history on top of global rankings. A simple approach:
final_score = alpha * global_score + (1 - alpha) * user_score
Where alpha is typically 0.7, favoring global trends but boosting personal relevance.
Top-K Caching at Each Node#
Recomputing rankings on every keystroke is expensive. Instead, precompute and cache the top-K (e.g., top 10) suggestions at every trie node. When a user types a prefix, the system returns the cached list directly — O(1) lookup after reaching the node.
Cache Invalidation#
Rebuild top-K lists periodically (e.g., every 15 minutes) using an offline MapReduce job that processes query logs and updates the trie.
Real-Time Updates#
For trending queries (breaking news, viral events), waiting 15 minutes is too slow.
Approach: Dual-Layer Architecture#
- Base trie — rebuilt periodically from historical data.
- Real-time trie — an in-memory trie that ingests the latest query stream via Kafka. Results from both layers are merged at query time.
This avoids constant rebuilds of the full trie while keeping suggestions fresh.
Distributed Trie#
A single-server trie won't scale to billions of terms. Two sharding strategies:
Range-Based Sharding#
Partition by first character (or first two characters). Server A handles a–f, Server B handles g–m, etc. Simple but can create hotspots — queries starting with "s" are far more common than "x."
Hash-Based Sharding#
Hash the prefix and distribute across servers. Balances load more evenly but makes range queries harder. A consistent hashing ring handles node additions and removals gracefully.
Replication#
Each shard is replicated 3x for fault tolerance. Reads go to the nearest replica; writes go to the leader.
Multi-Language Support#
Autocomplete must handle diverse scripts and input methods.
- Unicode normalization — apply NFC normalization so that equivalent character sequences match.
- Transliteration — map romanized input (e.g., "beijing") to native script (北京).
- CJK tokenization — Chinese, Japanese, and Korean lack word boundaries. Use n-gram indexing or dictionary-based segmentation.
- Right-to-left scripts — ensure prefix matching works correctly for Arabic and Hebrew.
Fuzzy Matching#
Users make typos. A strict prefix match for "amazn" returns nothing.
Edit Distance#
Compute Levenshtein distance between the query and stored terms. Terms within distance 1–2 are included as suggestions. This is expensive at scale, so limit fuzzy matching to short prefixes or use BK-trees.
Phonetic Matching#
Algorithms like Soundex or Metaphone group words that sound alike, helping with misspellings that preserve pronunciation.
Combining Approaches#
Elasticsearch's completion suggester supports fuzzy matching out of the box:
{
"suggest": {
"song-suggest": {
"prefix": "amazn",
"completion": {
"field": "suggest",
"fuzzy": { "fuzziness": 2 }
}
}
}
}
Tools and Technologies#
Elasticsearch Completion Suggester#
Uses an in-memory FST (finite state transducer) for fast prefix lookups. Supports weights, fuzzy matching, and context-based filtering. Ideal for medium-scale deployments.
Algolia#
A hosted search-as-a-service with built-in typo tolerance, geo-aware ranking, and sub-10ms responses. Great when you want autocomplete without managing infrastructure.
Redis + Sorted Sets#
Store terms in sorted sets scored by frequency. Use ZRANGEBYLEX for prefix queries and ZREVRANGEBYSCORE for top-K retrieval. Works well for moderate-scale systems with simple ranking.
Custom Trie Service#
For Google-scale systems, a custom trie service built in C++ or Rust gives maximum control over memory layout, sharding, and ranking logic.
Putting It All Together#
A production autocomplete system at scale looks like this:
- Query logs flow into Kafka.
- A batch pipeline (Spark/MapReduce) rebuilds the base trie every 15 minutes.
- A streaming pipeline (Flink/Kafka Streams) updates the real-time trie.
- Trie servers are sharded by prefix range and replicated 3x.
- A CDN/edge cache stores top-K results for the most popular prefixes.
- The API gateway merges base + real-time results, applies personalization, and returns top 10 suggestions.
Latency Budget#
| Component | Target |
|---|---|
| CDN cache hit | < 5ms |
| Trie lookup | < 10ms |
| Merge + rank | < 5ms |
| Network (edge) | < 30ms |
| Total | < 50ms |
Key Takeaways#
- Tries are the backbone — compressed tries save memory, cached top-K lists save computation.
- Ranking combines frequency, recency, and personalization.
- A dual-layer architecture (batch + streaming) keeps suggestions fresh without constant rebuilds.
- Distributed sharding and replication are essential at scale.
- Fuzzy matching and multi-language support make the system robust for real users.
Autocomplete is deceptively simple on the surface but deeply complex underneath. Mastering its design demonstrates fluency in data structures, distributed systems, caching, and ranking — exactly what interviewers look for.
Build and deploy full-stack projects with guided system design exercises at codelit.io.
This is article #183 in the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for Autocomplete System Design in seconds.
Try it in Codelit →
Comments