LSM treelog-structured merge treeSSTablescompactionRocksDBLevelDBwrite amplificationstorage enginesystem design

LSM Tree: The Write-Optimized Storage Engine Behind Modern Databases

March 29, 2026 7 min readBy Codelit Team Discussion

Most relational databases rely on B-trees for indexing. B-trees are read-friendly, but every random write touches disk in place — a costly operation at scale. The log-structured merge tree (LSM tree) takes the opposite approach: buffer writes in memory, flush them sequentially, and merge sorted files in the background. The result is dramatically higher write throughput at the cost of more complex reads.

How an LSM Tree Works#

An LSM tree organizes data across two layers: an in-memory buffer and a set of on-disk sorted files.

Write path
─────────────────────────────────────────────────
Client write ──▶ WAL (append) ──▶ Memtable (sorted in memory)
                                       │
                               (threshold reached)
                                       ▼
                              Flush to SSTable on disk
                                       │
                              Background compaction
                                       ▼
                             Merged, deduplicated SSTables

Memtable#

The memtable is a sorted, in-memory data structure — typically a red-black tree or skip list. Every write goes to the memtable first (after being appended to a write-ahead log for durability). When the memtable exceeds a size threshold (commonly 64 MB), it becomes immutable and is flushed to disk as a sorted string table (SSTable).

SSTables#

An SSTable is an immutable, sorted file on disk. Each SSTable contains a sequence of key-value pairs in sorted order, along with an index block for binary search. Because SSTables are written sequentially and never modified in place, they are extremely efficient to produce and cache-friendly to read.

Write-Ahead Log#

Before touching the memtable, every mutation is appended to a write-ahead log (WAL). If the process crashes before the memtable is flushed, the WAL replays uncommitted writes on restart. Once an SSTable is safely on disk, its corresponding WAL segment is deleted.

Compaction Strategies#

Over time, multiple SSTables accumulate on disk. Many of them contain stale versions of the same key. Compaction merges SSTables, discards obsolete entries, and reclaims space. The compaction strategy has a profound effect on read amplification, write amplification, and space amplification.

Size-Tiered Compaction#

SSTables of similar sizes are grouped into tiers. When a tier accumulates enough files (e.g., four), they are merged into a single larger SSTable that moves to the next tier.

Pros: High write throughput, simple logic.
Cons: Temporary space amplification (old and new files coexist during merge). Reads may scan many files.

Cassandra uses size-tiered compaction by default.

Leveled Compaction#

SSTables are organized into levels (L0, L1, L2, ...). Each level has a size limit that grows by a fixed multiplier (typically 10x). When a level exceeds its limit, one SSTable is picked and merged into the overlapping range of the next level.

Pros: Bounded space amplification. Reads touch fewer files because key ranges do not overlap within a level.
Cons: Higher write amplification — a single key may be rewritten across many levels.

RocksDB and LevelDB default to leveled compaction.

FIFO Compaction#

The oldest SSTables are simply deleted once total size exceeds a threshold. FIFO compaction suits time-series or TTL-based workloads where old data naturally expires.

Write Amplification#

Write amplification measures how many bytes are written to disk for each byte of user data. In leveled compaction, a single key may be rewritten every time its SSTable participates in a merge — potentially once per level. With a 10x size ratio and five levels, the worst-case write amplification is roughly 10x per level transition.

Tuning levers include:

Increasing the size ratio — fewer levels, less rewriting, but larger individual merges.
Using size-tiered compaction — lower write amplification at the cost of higher space amplification.
Combining strategies — RocksDB supports universal compaction, a hybrid approach.

Bloom Filters for Reads#

Reading from an LSM tree is inherently more expensive than from a B-tree because the key might reside in any SSTable. A point lookup in the worst case checks the memtable, then every SSTable from newest to oldest.

Bloom filters dramatically reduce this cost. Each SSTable carries a bloom filter — a probabilistic data structure that answers "is this key possibly in this file?" in O(1) time. A negative answer is definitive: skip the file. A positive answer may be a false positive (tunable, commonly under 1%), so the SSTable is checked.

With bloom filters, a point lookup typically touches only one or two SSTables instead of dozens.

Read Path in Detail#

Point read for key K
─────────────────────────────────────────────────
1. Check active memtable
2. Check immutable memtable (if flush in progress)
3. For each SSTable (newest → oldest):
   a. Consult bloom filter → skip if negative
   b. Binary search index block
   c. Read data block → return if found
4. Key does not exist

Range scans merge iterators across all SSTables using a priority queue (merge sort), returning keys in order.

LSM Tree vs B-Tree#

Dimension	LSM Tree	B-Tree
Write throughput	High (sequential I/O)	Moderate (random I/O)
Read latency	Higher (multiple files)	Lower (single tree)
Write amplification	Can be high (compaction)	Moderate (page splits)
Space amplification	Varies by compaction	Low (in-place updates)
Concurrency	Lock-free memtable writes	Page-level locking
Use case sweet spot	Write-heavy, time-series	Read-heavy, OLTP

In practice, many systems let you choose. MySQL with MyRocks uses LSM trees; MySQL with InnoDB uses B-trees. The workload dictates the winner.

Tools and Implementations#

RocksDB#

Developed by Facebook, RocksDB is the most widely deployed LSM engine. It powers Meta's internal MySQL (MyRocks), CockroachDB's storage layer, and TiKV (the storage engine behind TiDB). RocksDB offers leveled, universal, and FIFO compaction, column families, rate limiters, and extensive tuning knobs.

LevelDB#

Google's original LSM implementation. LevelDB is simpler and more constrained than RocksDB — single-threaded compaction, no column families — but remains a solid reference implementation and powers Chrome's IndexedDB backend.

Apache Cassandra#

Cassandra uses an LSM storage engine with pluggable compaction strategies: SizeTieredCompactionStrategy, LeveledCompactionStrategy, TimeWindowCompactionStrategy (ideal for time-series), and UnifiedCompactionStrategy in newer versions.

ScyllaDB#

A C++ rewrite of Cassandra that pushes LSM performance further with shard-per-core architecture and incremental compaction to avoid latency spikes during large merges.

Practical Tuning Tips#

Size your memtable to balance write latency and flush frequency. Larger memtables absorb more writes but increase recovery time on crash.
Tune bloom filter bits per key — 10 bits gives roughly 1% false positive rate. Increasing to 14 bits drops it to 0.1%.
Monitor compaction debt — if compaction falls behind, reads degrade because more SSTables accumulate. RocksDB exposes compaction_pending metrics.
Use compression wisely — Snappy or LZ4 for speed, Zstd for higher compression on cold levels.
Separate WAL and data directories onto different disks to avoid I/O contention between logging and compaction.

When to Choose an LSM Tree#

Choose an LSM-based engine when your workload is write-heavy or append-heavy: event logs, time-series telemetry, message queues, and large-scale key-value stores. If your workload is read-dominated with complex queries (joins, aggregations), a B-tree-based engine typically wins.

Understanding LSM trees is essential for system design interviews and for anyone building or operating distributed databases. The trade-offs between write amplification, read amplification, and space amplification sit at the heart of every storage engine decision.

That is article #275 on Codelit. Explore the full archive for deep dives on distributed systems, data structures, algorithms, and software engineering fundamentals. New articles every week.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Cloud File Storage Platform

Dropbox-like file storage with sync, sharing, versioning, and real-time collaboration.

8 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Recommendation Engine

Personalized recommendation system with collaborative filtering, content-based matching, and real-time ranking.

8 components

Build this architecture

Generate an interactive architecture for LSM Tree in seconds.

Try it in Codelit →

LSM treelog-structured merge treeSSTablescompactionRocksDBLevelDBwrite amplificationstorage enginesystem design

LSM Tree: The Write-Optimized Storage Engine Behind Modern Databases

March 29, 2026 7 min readBy Codelit Team Discussion

How an LSM Tree Works#

An LSM tree organizes data across two layers: an in-memory buffer and a set of on-disk sorted files.

Write path
─────────────────────────────────────────────────
Client write ──▶ WAL (append) ──▶ Memtable (sorted in memory)
                                       │
                               (threshold reached)
                                       ▼
                              Flush to SSTable on disk
                                       │
                              Background compaction
                                       ▼
                             Merged, deduplicated SSTables

Memtable#

SSTables#

Write-Ahead Log#

Compaction Strategies#

Size-Tiered Compaction#

SSTables of similar sizes are grouped into tiers. When a tier accumulates enough files (e.g., four), they are merged into a single larger SSTable that moves to the next tier.

Pros: High write throughput, simple logic.
Cons: Temporary space amplification (old and new files coexist during merge). Reads may scan many files.

Cassandra uses size-tiered compaction by default.

Leveled Compaction#

Pros: Bounded space amplification. Reads touch fewer files because key ranges do not overlap within a level.
Cons: Higher write amplification — a single key may be rewritten across many levels.

RocksDB and LevelDB default to leveled compaction.

FIFO Compaction#

The oldest SSTables are simply deleted once total size exceeds a threshold. FIFO compaction suits time-series or TTL-based workloads where old data naturally expires.

Write Amplification#

Tuning levers include:

Increasing the size ratio — fewer levels, less rewriting, but larger individual merges.
Using size-tiered compaction — lower write amplification at the cost of higher space amplification.
Combining strategies — RocksDB supports universal compaction, a hybrid approach.

Bloom Filters for Reads#

With bloom filters, a point lookup typically touches only one or two SSTables instead of dozens.

Read Path in Detail#

Point read for key K
─────────────────────────────────────────────────
1. Check active memtable
2. Check immutable memtable (if flush in progress)
3. For each SSTable (newest → oldest):
   a. Consult bloom filter → skip if negative
   b. Binary search index block
   c. Read data block → return if found
4. Key does not exist

Range scans merge iterators across all SSTables using a priority queue (merge sort), returning keys in order.

LSM Tree vs B-Tree#

Dimension	LSM Tree	B-Tree
Write throughput	High (sequential I/O)	Moderate (random I/O)
Read latency	Higher (multiple files)	Lower (single tree)
Write amplification	Can be high (compaction)	Moderate (page splits)
Space amplification	Varies by compaction	Low (in-place updates)
Concurrency	Lock-free memtable writes	Page-level locking
Use case sweet spot	Write-heavy, time-series	Read-heavy, OLTP

In practice, many systems let you choose. MySQL with MyRocks uses LSM trees; MySQL with InnoDB uses B-trees. The workload dictates the winner.

Tools and Implementations#

RocksDB#

LevelDB#

Apache Cassandra#

ScyllaDB#

A C++ rewrite of Cassandra that pushes LSM performance further with shard-per-core architecture and incremental compaction to avoid latency spikes during large merges.

Practical Tuning Tips#

Size your memtable to balance write latency and flush frequency. Larger memtables absorb more writes but increase recovery time on crash.
Tune bloom filter bits per key — 10 bits gives roughly 1% false positive rate. Increasing to 14 bits drops it to 0.1%.
Monitor compaction debt — if compaction falls behind, reads degrade because more SSTables accumulate. RocksDB exposes compaction_pending metrics.
Use compression wisely — Snappy or LZ4 for speed, Zstd for higher compression on cold levels.
Separate WAL and data directories onto different disks to avoid I/O contention between logging and compaction.

When to Choose an LSM Tree#

That is article #275 on Codelit. Explore the full archive for deep dives on distributed systems, data structures, algorithms, and software engineering fundamentals. New articles every week.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive architecture for LSM Tree in seconds.

Try it in Codelit →

LSM Tree: The Write-Optimized Storage Engine Behind Modern Databases

How an LSM Tree Works#

Memtable#

SSTables#

Write-Ahead Log#

Compaction Strategies#

Size-Tiered Compaction#

Leveled Compaction#

FIFO Compaction#

Write Amplification#

Bloom Filters for Reads#

Read Path in Detail#

LSM Tree vs B-Tree#

Tools and Implementations#

RocksDB#

LevelDB#

Apache Cassandra#

ScyllaDB#

Practical Tuning Tips#

When to Choose an LSM Tree#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Cloud File Storage Platform

Search Engine Architecture

Recommendation Engine

Build this architecture

LSM Tree: The Write-Optimized Storage Engine Behind Modern Databases

How an LSM Tree Works#

Memtable#

SSTables#

Write-Ahead Log#

Compaction Strategies#

Size-Tiered Compaction#

Leveled Compaction#

FIFO Compaction#

Write Amplification#

Bloom Filters for Reads#

Read Path in Detail#

LSM Tree vs B-Tree#

Tools and Implementations#

RocksDB#

LevelDB#

Apache Cassandra#

ScyllaDB#

Practical Tuning Tips#

When to Choose an LSM Tree#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Cloud File Storage Platform

Search Engine Architecture

Recommendation Engine

Build this architecture