Key-Value Store Design: From Hash Tables to Dynamo
A key-value store is the simplest and most versatile distributed data system. You give it a key, it gives you a value. Behind that simple interface lies a rich design space covering hashing, partitioning, replication, durability, and compaction. Understanding these internals is essential for any system design interview — and for choosing the right store in production.
Hash Table Internals#
At its core, every key-value store is a hash table. A hash function maps each key to a slot in an array. Two design decisions dominate:
- Hash function choice — A good function distributes keys uniformly. MurmurHash3 and xxHash are popular for in-memory stores; cryptographic hashes like SHA-256 are unnecessary and slow.
- Collision resolution — Chaining (linked lists per bucket) or open addressing (probing for the next empty slot). Most production stores use chaining because deletes are simpler.
For an in-memory store like Redis, a well-tuned hash table with incremental rehashing is the entire storage engine.
Consistent Hashing#
When you shard data across multiple nodes, naive modular hashing (hash(key) % N) forces a full reshuffle every time a node is added or removed. Consistent hashing solves this:
Node A Node B Node C
──────────●───────────────●───────────────●──────────▶ hash ring
key1 ↗ key2 ↗ key3 ↗
Each node occupies a position on a hash ring. A key is assigned to the first node encountered clockwise from its hash position. Adding or removing a node only reassigns keys in the adjacent range.
Virtual nodes improve balance: each physical node maps to many positions on the ring, smoothing out hot spots. DynamoDB and Cassandra both use this approach.
Partitioning Strategies#
Beyond consistent hashing, two other strategies appear in practice:
| Strategy | Pros | Cons |
|---|---|---|
| Range partitioning | Efficient range scans | Hot spots on sequential writes |
| Hash partitioning | Uniform distribution | No range queries |
| Hybrid (compound key) | Partition by hash, sort by range | More complex schema |
DynamoDB uses hash partitioning on the partition key and supports range queries within a partition via the sort key — a classic hybrid approach.
Replication#
Every durable key-value store replicates data to survive node failures. Two models:
Leader-Follower (Strong Consistency)#
One node accepts writes; followers replicate asynchronously or synchronously. etcd uses Raft consensus for leader election and synchronous replication, guaranteeing linearizability.
Leaderless (Eventual Consistency)#
Any node can accept writes. Reads query multiple replicas and reconcile conflicts. Dynamo, Riak, and Cassandra follow this model.
Quorum reads and writes ensure consistency without a leader:
W + R > N → guaranteed overlap between write and read sets
Example: N=3, W=2, R=2
Write goes to 2 of 3 replicas
Read queries 2 of 3 replicas
At least 1 replica has the latest write
Write-Ahead Log (WAL)#
Durability requires surviving crashes. Before applying a mutation to the in-memory data structure, the store appends it to a sequential log on disk:
- Client sends
PUT(key, value). - Store appends the operation to the WAL (fsync).
- Store updates the in-memory hash table.
- Store acknowledges the write.
On restart, the store replays the WAL to rebuild its state. Redis AOF (append-only file) and etcd's WAL both follow this pattern.
Compaction: LSM Trees and SSTables#
In-memory hash tables cannot hold unlimited data. The Log-Structured Merge-tree (LSM tree) solves this by combining an in-memory write buffer with on-disk sorted files.
Write Path#
- Writes go into an in-memory balanced tree (memtable).
- When the memtable exceeds a threshold, it is flushed to disk as a Sorted String Table (SSTable) — an immutable, sorted file of key-value pairs.
- The WAL is truncated after a successful flush.
Read Path#
- Check the memtable first.
- Check SSTables from newest to oldest.
- Use Bloom filters to skip SSTables that definitely do not contain the key.
Compaction#
Over time, SSTables accumulate. Compaction merges overlapping SSTables, discards deleted keys (tombstones), and produces fewer, larger files:
Level 0: [SST-1] [SST-2] [SST-3] ← freshly flushed, may overlap
↓ merge
Level 1: [SST-A] [SST-B] ← non-overlapping, sorted
↓ merge
Level 2: [SST-X] ← larger, fewer files
Size-tiered compaction (Cassandra default) groups similarly sized SSTables. Leveled compaction (LevelDB, RocksDB) maintains strict size ratios per level and offers better read performance at the cost of more write amplification.
Dynamo-Style Design#
Amazon's Dynamo paper (2007) introduced a design philosophy adopted by Riak, Cassandra, and eventually DynamoDB. Key ideas:
Eventual Consistency#
Writes propagate asynchronously. Two clients may temporarily see different values for the same key. The system converges over time via anti-entropy (Merkle tree comparison) and read repair (fixing stale replicas during reads).
Vector Clocks#
To detect conflicts, each value carries a vector clock — a list of (node, counter) pairs. When two versions have incompatible clocks (neither dominates), the application must resolve the conflict.
Client A writes: {A:1}
Client B writes: {B:1} ← concurrent, conflict detected
Application merges both versions on next read
Sloppy Quorum and Hinted Handoff#
When target replicas are unreachable, writes are temporarily stored on nearby healthy nodes (hinted handoff). Once the original node recovers, hints are replayed. This trades strict consistency for higher availability — the "sloppy" quorum.
Tools and When to Use Them#
| Store | Model | Best For |
|---|---|---|
| Redis | In-memory, leader-follower | Caching, sessions, leaderboards |
| DynamoDB | Managed, hash+range partitioning | Serverless apps, predictable latency |
| etcd | Raft consensus, strong consistency | Service discovery, config, leader election |
| Riak | Dynamo-style, eventual consistency | High-availability writes, conflict resolution |
| RocksDB | Embedded LSM tree | Storage engine for other databases |
Redis#
Single-threaded event loop, sub-millisecond latency, optional persistence via RDB snapshots or AOF. Use it when data fits in memory and you need speed.
DynamoDB#
Fully managed, auto-scaling, single-digit millisecond latency. Partition key design is critical — poor key choice leads to hot partitions and throttling.
etcd#
The backbone of Kubernetes. Strong consistency via Raft makes it ideal for distributed coordination, but write throughput is limited by consensus round-trips.
Riak#
Masterless, Dynamo-inspired. Excellent write availability, but conflict resolution adds application complexity. Less popular today but a great study of Dynamo principles in practice.
Design Interview Checklist#
When asked to design a key-value store, walk through these decisions:
- Read/write ratio — Read-heavy favors caching; write-heavy favors LSM trees.
- Consistency requirement — Strong (etcd/Raft) vs. eventual (Dynamo/quorum).
- Data size — Fits in memory (Redis) vs. disk-based (LSM tree).
- Partitioning — Consistent hashing with virtual nodes.
- Replication — Leader-follower vs. leaderless; quorum parameters.
- Durability — WAL + SSTable flush.
- Failure handling — Hinted handoff, read repair, Merkle trees.
- Compaction strategy — Size-tiered vs. leveled.
Build, visualize, and practice system design at codelit.io.
This is article #193 in the Codelit system design series.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Comments