LSM Tree: The Write-Optimized Storage Engine Behind Modern Databases
Most relational databases rely on B-trees for indexing. B-trees are read-friendly, but every random write touches disk in place — a costly operation at scale. The log-structured merge tree (LSM tree) takes the opposite approach: buffer writes in memory, flush them sequentially, and merge sorted files in the background. The result is dramatically higher write throughput at the cost of more complex reads.
How an LSM Tree Works#
An LSM tree organizes data across two layers: an in-memory buffer and a set of on-disk sorted files.
Write path
─────────────────────────────────────────────────
Client write ──▶ WAL (append) ──▶ Memtable (sorted in memory)
│
(threshold reached)
▼
Flush to SSTable on disk
│
Background compaction
▼
Merged, deduplicated SSTables
Memtable#
The memtable is a sorted, in-memory data structure — typically a red-black tree or skip list. Every write goes to the memtable first (after being appended to a write-ahead log for durability). When the memtable exceeds a size threshold (commonly 64 MB), it becomes immutable and is flushed to disk as a sorted string table (SSTable).
SSTables#
An SSTable is an immutable, sorted file on disk. Each SSTable contains a sequence of key-value pairs in sorted order, along with an index block for binary search. Because SSTables are written sequentially and never modified in place, they are extremely efficient to produce and cache-friendly to read.
Write-Ahead Log#
Before touching the memtable, every mutation is appended to a write-ahead log (WAL). If the process crashes before the memtable is flushed, the WAL replays uncommitted writes on restart. Once an SSTable is safely on disk, its corresponding WAL segment is deleted.
Compaction Strategies#
Over time, multiple SSTables accumulate on disk. Many of them contain stale versions of the same key. Compaction merges SSTables, discards obsolete entries, and reclaims space. The compaction strategy has a profound effect on read amplification, write amplification, and space amplification.
Size-Tiered Compaction#
SSTables of similar sizes are grouped into tiers. When a tier accumulates enough files (e.g., four), they are merged into a single larger SSTable that moves to the next tier.
- Pros: High write throughput, simple logic.
- Cons: Temporary space amplification (old and new files coexist during merge). Reads may scan many files.
Cassandra uses size-tiered compaction by default.
Leveled Compaction#
SSTables are organized into levels (L0, L1, L2, ...). Each level has a size limit that grows by a fixed multiplier (typically 10x). When a level exceeds its limit, one SSTable is picked and merged into the overlapping range of the next level.
- Pros: Bounded space amplification. Reads touch fewer files because key ranges do not overlap within a level.
- Cons: Higher write amplification — a single key may be rewritten across many levels.
RocksDB and LevelDB default to leveled compaction.
FIFO Compaction#
The oldest SSTables are simply deleted once total size exceeds a threshold. FIFO compaction suits time-series or TTL-based workloads where old data naturally expires.
Write Amplification#
Write amplification measures how many bytes are written to disk for each byte of user data. In leveled compaction, a single key may be rewritten every time its SSTable participates in a merge — potentially once per level. With a 10x size ratio and five levels, the worst-case write amplification is roughly 10x per level transition.
Tuning levers include:
- Increasing the size ratio — fewer levels, less rewriting, but larger individual merges.
- Using size-tiered compaction — lower write amplification at the cost of higher space amplification.
- Combining strategies — RocksDB supports universal compaction, a hybrid approach.
Bloom Filters for Reads#
Reading from an LSM tree is inherently more expensive than from a B-tree because the key might reside in any SSTable. A point lookup in the worst case checks the memtable, then every SSTable from newest to oldest.
Bloom filters dramatically reduce this cost. Each SSTable carries a bloom filter — a probabilistic data structure that answers "is this key possibly in this file?" in O(1) time. A negative answer is definitive: skip the file. A positive answer may be a false positive (tunable, commonly under 1%), so the SSTable is checked.
With bloom filters, a point lookup typically touches only one or two SSTables instead of dozens.
Read Path in Detail#
Point read for key K
─────────────────────────────────────────────────
1. Check active memtable
2. Check immutable memtable (if flush in progress)
3. For each SSTable (newest → oldest):
a. Consult bloom filter → skip if negative
b. Binary search index block
c. Read data block → return if found
4. Key does not exist
Range scans merge iterators across all SSTables using a priority queue (merge sort), returning keys in order.
LSM Tree vs B-Tree#
| Dimension | LSM Tree | B-Tree |
|---|---|---|
| Write throughput | High (sequential I/O) | Moderate (random I/O) |
| Read latency | Higher (multiple files) | Lower (single tree) |
| Write amplification | Can be high (compaction) | Moderate (page splits) |
| Space amplification | Varies by compaction | Low (in-place updates) |
| Concurrency | Lock-free memtable writes | Page-level locking |
| Use case sweet spot | Write-heavy, time-series | Read-heavy, OLTP |
In practice, many systems let you choose. MySQL with MyRocks uses LSM trees; MySQL with InnoDB uses B-trees. The workload dictates the winner.
Tools and Implementations#
RocksDB#
Developed by Facebook, RocksDB is the most widely deployed LSM engine. It powers Meta's internal MySQL (MyRocks), CockroachDB's storage layer, and TiKV (the storage engine behind TiDB). RocksDB offers leveled, universal, and FIFO compaction, column families, rate limiters, and extensive tuning knobs.
LevelDB#
Google's original LSM implementation. LevelDB is simpler and more constrained than RocksDB — single-threaded compaction, no column families — but remains a solid reference implementation and powers Chrome's IndexedDB backend.
Apache Cassandra#
Cassandra uses an LSM storage engine with pluggable compaction strategies: SizeTieredCompactionStrategy, LeveledCompactionStrategy, TimeWindowCompactionStrategy (ideal for time-series), and UnifiedCompactionStrategy in newer versions.
ScyllaDB#
A C++ rewrite of Cassandra that pushes LSM performance further with shard-per-core architecture and incremental compaction to avoid latency spikes during large merges.
Practical Tuning Tips#
- Size your memtable to balance write latency and flush frequency. Larger memtables absorb more writes but increase recovery time on crash.
- Tune bloom filter bits per key — 10 bits gives roughly 1% false positive rate. Increasing to 14 bits drops it to 0.1%.
- Monitor compaction debt — if compaction falls behind, reads degrade because more SSTables accumulate. RocksDB exposes
compaction_pendingmetrics. - Use compression wisely — Snappy or LZ4 for speed, Zstd for higher compression on cold levels.
- Separate WAL and data directories onto different disks to avoid I/O contention between logging and compaction.
When to Choose an LSM Tree#
Choose an LSM-based engine when your workload is write-heavy or append-heavy: event logs, time-series telemetry, message queues, and large-scale key-value stores. If your workload is read-dominated with complex queries (joins, aggregations), a B-tree-based engine typically wins.
Understanding LSM trees is essential for system design interviews and for anyone building or operating distributed databases. The trade-offs between write amplification, read amplification, and space amplification sit at the heart of every storage engine decision.
That is article #275 on Codelit. Explore the full archive for deep dives on distributed systems, data structures, algorithms, and software engineering fundamentals. New articles every week.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Cloud File Storage Platform
Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsRecommendation Engine
Personalized recommendation system with collaborative filtering, content-based matching, and real-time ranking.
8 componentsBuild this architecture
Generate an interactive architecture for LSM Tree in seconds.
Try it in Codelit →
Comments