key-value store designconsistent hashingLSM treeSSTableswrite-ahead logDynamoDBRedissystem design

Key-Value Store Design: From Hash Tables to Dynamo

March 28, 2026 6 min readBy Codelit Team Discussion

A key-value store is the simplest and most versatile distributed data system. You give it a key, it gives you a value. Behind that simple interface lies a rich design space covering hashing, partitioning, replication, durability, and compaction. Understanding these internals is essential for any system design interview — and for choosing the right store in production.

Hash Table Internals#

At its core, every key-value store is a hash table. A hash function maps each key to a slot in an array. Two design decisions dominate:

Hash function choice — A good function distributes keys uniformly. MurmurHash3 and xxHash are popular for in-memory stores; cryptographic hashes like SHA-256 are unnecessary and slow.
Collision resolution — Chaining (linked lists per bucket) or open addressing (probing for the next empty slot). Most production stores use chaining because deletes are simpler.

For an in-memory store like Redis, a well-tuned hash table with incremental rehashing is the entire storage engine.

Consistent Hashing#

When you shard data across multiple nodes, naive modular hashing (hash(key) % N) forces a full reshuffle every time a node is added or removed. Consistent hashing solves this:

        Node A          Node B          Node C
──────────●───────────────●───────────────●──────────▶ hash ring
      key1 ↗          key2 ↗          key3 ↗

Each node occupies a position on a hash ring. A key is assigned to the first node encountered clockwise from its hash position. Adding or removing a node only reassigns keys in the adjacent range.

Virtual nodes improve balance: each physical node maps to many positions on the ring, smoothing out hot spots. DynamoDB and Cassandra both use this approach.

Partitioning Strategies#

Beyond consistent hashing, two other strategies appear in practice:

Strategy	Pros	Cons
Range partitioning	Efficient range scans	Hot spots on sequential writes
Hash partitioning	Uniform distribution	No range queries
Hybrid (compound key)	Partition by hash, sort by range	More complex schema

DynamoDB uses hash partitioning on the partition key and supports range queries within a partition via the sort key — a classic hybrid approach.

Replication#

Every durable key-value store replicates data to survive node failures. Two models:

Leader-Follower (Strong Consistency)#

One node accepts writes; followers replicate asynchronously or synchronously. etcd uses Raft consensus for leader election and synchronous replication, guaranteeing linearizability.

Leaderless (Eventual Consistency)#

Any node can accept writes. Reads query multiple replicas and reconcile conflicts. Dynamo, Riak, and Cassandra follow this model.

Quorum reads and writes ensure consistency without a leader:

W + R > N  →  guaranteed overlap between write and read sets

Example: N=3, W=2, R=2
  Write goes to 2 of 3 replicas
  Read queries 2 of 3 replicas
  At least 1 replica has the latest write

Write-Ahead Log (WAL)#

Durability requires surviving crashes. Before applying a mutation to the in-memory data structure, the store appends it to a sequential log on disk:

Client sends PUT(key, value).
Store appends the operation to the WAL (fsync).
Store updates the in-memory hash table.
Store acknowledges the write.

On restart, the store replays the WAL to rebuild its state. Redis AOF (append-only file) and etcd's WAL both follow this pattern.

Compaction: LSM Trees and SSTables#

In-memory hash tables cannot hold unlimited data. The Log-Structured Merge-tree (LSM tree) solves this by combining an in-memory write buffer with on-disk sorted files.

Write Path#

Writes go into an in-memory balanced tree (memtable).
When the memtable exceeds a threshold, it is flushed to disk as a Sorted String Table (SSTable) — an immutable, sorted file of key-value pairs.
The WAL is truncated after a successful flush.

Read Path#

Check the memtable first.
Check SSTables from newest to oldest.
Use Bloom filters to skip SSTables that definitely do not contain the key.

Compaction#

Over time, SSTables accumulate. Compaction merges overlapping SSTables, discards deleted keys (tombstones), and produces fewer, larger files:

Level 0:  [SST-1] [SST-2] [SST-3]    ← freshly flushed, may overlap
              ↓ merge
Level 1:  [SST-A]  [SST-B]            ← non-overlapping, sorted
              ↓ merge
Level 2:  [SST-X]                      ← larger, fewer files

Size-tiered compaction (Cassandra default) groups similarly sized SSTables. Leveled compaction (LevelDB, RocksDB) maintains strict size ratios per level and offers better read performance at the cost of more write amplification.

Dynamo-Style Design#

Amazon's Dynamo paper (2007) introduced a design philosophy adopted by Riak, Cassandra, and eventually DynamoDB. Key ideas:

Eventual Consistency#

Writes propagate asynchronously. Two clients may temporarily see different values for the same key. The system converges over time via anti-entropy (Merkle tree comparison) and read repair (fixing stale replicas during reads).

Vector Clocks#

To detect conflicts, each value carries a vector clock — a list of (node, counter) pairs. When two versions have incompatible clocks (neither dominates), the application must resolve the conflict.

Client A writes: {A:1}
Client B writes: {B:1}       ← concurrent, conflict detected
Application merges both versions on next read

Sloppy Quorum and Hinted Handoff#

When target replicas are unreachable, writes are temporarily stored on nearby healthy nodes (hinted handoff). Once the original node recovers, hints are replayed. This trades strict consistency for higher availability — the "sloppy" quorum.

Tools and When to Use Them#

Store	Model	Best For
Redis	In-memory, leader-follower	Caching, sessions, leaderboards
DynamoDB	Managed, hash+range partitioning	Serverless apps, predictable latency
etcd	Raft consensus, strong consistency	Service discovery, config, leader election
Riak	Dynamo-style, eventual consistency	High-availability writes, conflict resolution
RocksDB	Embedded LSM tree	Storage engine for other databases

Redis#

Single-threaded event loop, sub-millisecond latency, optional persistence via RDB snapshots or AOF. Use it when data fits in memory and you need speed.

DynamoDB#

Fully managed, auto-scaling, single-digit millisecond latency. Partition key design is critical — poor key choice leads to hot partitions and throttling.

etcd#

The backbone of Kubernetes. Strong consistency via Raft makes it ideal for distributed coordination, but write throughput is limited by consensus round-trips.

Riak#

Masterless, Dynamo-inspired. Excellent write availability, but conflict resolution adds application complexity. Less popular today but a great study of Dynamo principles in practice.

Design Interview Checklist#

When asked to design a key-value store, walk through these decisions:

Read/write ratio — Read-heavy favors caching; write-heavy favors LSM trees.
Consistency requirement — Strong (etcd/Raft) vs. eventual (Dynamo/quorum).
Data size — Fits in memory (Redis) vs. disk-based (LSM tree).
Partitioning — Consistent hashing with virtual nodes.
Replication — Leader-follower vs. leaderless; quorum parameters.
Durability — WAL + SSTable flush.
Failure handling — Hinted handoff, read repair, Merkle trees.
Compaction strategy — Size-tiered vs. leveled.

Build, visualize, and practice system design at codelit.io.

This is article #193 in the Codelit system design series.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Key in seconds.

Try it in Codelit →

key-value store designconsistent hashingLSM treeSSTableswrite-ahead logDynamoDBRedissystem design

Key-Value Store Design: From Hash Tables to Dynamo

March 28, 2026 6 min readBy Codelit Team Discussion

Hash Table Internals#

At its core, every key-value store is a hash table. A hash function maps each key to a slot in an array. Two design decisions dominate:

Hash function choice — A good function distributes keys uniformly. MurmurHash3 and xxHash are popular for in-memory stores; cryptographic hashes like SHA-256 are unnecessary and slow.
Collision resolution — Chaining (linked lists per bucket) or open addressing (probing for the next empty slot). Most production stores use chaining because deletes are simpler.

For an in-memory store like Redis, a well-tuned hash table with incremental rehashing is the entire storage engine.

Consistent Hashing#

When you shard data across multiple nodes, naive modular hashing (hash(key) % N) forces a full reshuffle every time a node is added or removed. Consistent hashing solves this:

        Node A          Node B          Node C
──────────●───────────────●───────────────●──────────▶ hash ring
      key1 ↗          key2 ↗          key3 ↗

Each node occupies a position on a hash ring. A key is assigned to the first node encountered clockwise from its hash position. Adding or removing a node only reassigns keys in the adjacent range.

Virtual nodes improve balance: each physical node maps to many positions on the ring, smoothing out hot spots. DynamoDB and Cassandra both use this approach.

Partitioning Strategies#

Beyond consistent hashing, two other strategies appear in practice:

Strategy	Pros	Cons
Range partitioning	Efficient range scans	Hot spots on sequential writes
Hash partitioning	Uniform distribution	No range queries
Hybrid (compound key)	Partition by hash, sort by range	More complex schema

DynamoDB uses hash partitioning on the partition key and supports range queries within a partition via the sort key — a classic hybrid approach.

Replication#

Every durable key-value store replicates data to survive node failures. Two models:

Leader-Follower (Strong Consistency)#

One node accepts writes; followers replicate asynchronously or synchronously. etcd uses Raft consensus for leader election and synchronous replication, guaranteeing linearizability.

Leaderless (Eventual Consistency)#

Any node can accept writes. Reads query multiple replicas and reconcile conflicts. Dynamo, Riak, and Cassandra follow this model.

Quorum reads and writes ensure consistency without a leader:

W + R > N  →  guaranteed overlap between write and read sets

Example: N=3, W=2, R=2
  Write goes to 2 of 3 replicas
  Read queries 2 of 3 replicas
  At least 1 replica has the latest write

Write-Ahead Log (WAL)#

Durability requires surviving crashes. Before applying a mutation to the in-memory data structure, the store appends it to a sequential log on disk:

Client sends PUT(key, value).
Store appends the operation to the WAL (fsync).
Store updates the in-memory hash table.
Store acknowledges the write.

On restart, the store replays the WAL to rebuild its state. Redis AOF (append-only file) and etcd's WAL both follow this pattern.

Compaction: LSM Trees and SSTables#

In-memory hash tables cannot hold unlimited data. The Log-Structured Merge-tree (LSM tree) solves this by combining an in-memory write buffer with on-disk sorted files.

Write Path#

Writes go into an in-memory balanced tree (memtable).
When the memtable exceeds a threshold, it is flushed to disk as a Sorted String Table (SSTable) — an immutable, sorted file of key-value pairs.
The WAL is truncated after a successful flush.

Read Path#

Check the memtable first.
Check SSTables from newest to oldest.
Use Bloom filters to skip SSTables that definitely do not contain the key.

Compaction#

Over time, SSTables accumulate. Compaction merges overlapping SSTables, discards deleted keys (tombstones), and produces fewer, larger files:

Level 0:  [SST-1] [SST-2] [SST-3]    ← freshly flushed, may overlap
              ↓ merge
Level 1:  [SST-A]  [SST-B]            ← non-overlapping, sorted
              ↓ merge
Level 2:  [SST-X]                      ← larger, fewer files

Dynamo-Style Design#

Amazon's Dynamo paper (2007) introduced a design philosophy adopted by Riak, Cassandra, and eventually DynamoDB. Key ideas:

Eventual Consistency#

Vector Clocks#

Client A writes: {A:1}
Client B writes: {B:1}       ← concurrent, conflict detected
Application merges both versions on next read

Sloppy Quorum and Hinted Handoff#

Tools and When to Use Them#

Store	Model	Best For
Redis	In-memory, leader-follower	Caching, sessions, leaderboards
DynamoDB	Managed, hash+range partitioning	Serverless apps, predictable latency
etcd	Raft consensus, strong consistency	Service discovery, config, leader election
Riak	Dynamo-style, eventual consistency	High-availability writes, conflict resolution
RocksDB	Embedded LSM tree	Storage engine for other databases

Redis#

Single-threaded event loop, sub-millisecond latency, optional persistence via RDB snapshots or AOF. Use it when data fits in memory and you need speed.

DynamoDB#

Fully managed, auto-scaling, single-digit millisecond latency. Partition key design is critical — poor key choice leads to hot partitions and throttling.

etcd#

The backbone of Kubernetes. Strong consistency via Raft makes it ideal for distributed coordination, but write throughput is limited by consensus round-trips.

Riak#

Masterless, Dynamo-inspired. Excellent write availability, but conflict resolution adds application complexity. Less popular today but a great study of Dynamo principles in practice.

Design Interview Checklist#

When asked to design a key-value store, walk through these decisions:

Read/write ratio — Read-heavy favors caching; write-heavy favors LSM trees.
Consistency requirement — Strong (etcd/Raft) vs. eventual (Dynamo/quorum).
Data size — Fits in memory (Redis) vs. disk-based (LSM tree).
Partitioning — Consistent hashing with virtual nodes.
Replication — Leader-follower vs. leaderless; quorum parameters.
Durability — WAL + SSTable flush.
Failure handling — Hinted handoff, read repair, Merkle trees.
Compaction strategy — Size-tiered vs. leveled.

Build, visualize, and practice system design at codelit.io.

This is article #193 in the Codelit system design series.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Try these templates

Distributed Key-Value Store

Redis/DynamoDB-like distributed KV store with consistent hashing, replication, and tunable consistency.

8 components

Build this architecture

Generate an interactive architecture for Key in seconds.

Try it in Codelit →

Key-Value Store Design: From Hash Tables to Dynamo

Hash Table Internals#

Consistent Hashing#

Partitioning Strategies#

Replication#

Leader-Follower (Strong Consistency)#

Leaderless (Eventual Consistency)#

Write-Ahead Log (WAL)#

Compaction: LSM Trees and SSTables#

Write Path#

Read Path#

Compaction#

Dynamo-Style Design#

Eventual Consistency#

Vector Clocks#

Sloppy Quorum and Hinted Handoff#

Tools and When to Use Them#

Redis#

DynamoDB#

etcd#

Riak#

Design Interview Checklist#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Distributed Key-Value Store

Build this architecture

Key-Value Store Design: From Hash Tables to Dynamo

Hash Table Internals#

Consistent Hashing#

Partitioning Strategies#

Replication#

Leader-Follower (Strong Consistency)#

Leaderless (Eventual Consistency)#

Write-Ahead Log (WAL)#

Compaction: LSM Trees and SSTables#

Write Path#

Read Path#

Compaction#

Dynamo-Style Design#

Eventual Consistency#

Vector Clocks#

Sloppy Quorum and Hinted Handoff#

Tools and When to Use Them#

Redis#

DynamoDB#

etcd#

Riak#

Design Interview Checklist#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Distributed Key-Value Store

Build this architecture