Write-Ahead Log — How Databases Guarantee Durability and Crash Recovery
What is a write-ahead log?#
A write-ahead log (WAL) is a sequential, append-only file where a database records every change before applying it to the actual data files. If the system crashes mid-write, the database replays the log on startup to recover a consistent state.
The rule is simple: write the log first, then modify the data. No exceptions.
Why WAL exists#
Without WAL, a crash during a page write can leave data half-written — a torn page. The database would have no way to know which writes completed and which did not.
WAL solves this by guaranteeing that:
- Every modification is recorded in a durable log before it touches disk
- The log is sequential (fast on any storage medium)
- Recovery replays committed transactions and discards incomplete ones
How WAL works step by step#
- Transaction begins — the database assigns a log sequence number (LSN)
- Log record written — the change (old value, new value, page ID) is appended to the WAL
- WAL flushed to disk —
fsync()ensures the record is durable - Acknowledge commit — the client gets a success response
- Background flush — dirty pages are written to data files asynchronously
The key insight: the data files can lag behind. As long as the WAL is intact, every committed transaction can be reconstructed.
Crash recovery#
On startup after a crash, the database:
- Reads the WAL from the last checkpoint
- Redo phase — replays all committed transactions whose data pages may not have been flushed
- Undo phase — rolls back any transactions that were in progress but never committed
- Resumes normal operation
This is called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), the industry-standard recovery protocol used by most relational databases.
Checkpointing#
If the database replayed the entire WAL on every restart, recovery would take forever. Checkpoints solve this.
A checkpoint:
- Flushes all dirty pages to disk
- Records the current LSN as the checkpoint position
- Allows the database to truncate or recycle WAL segments before that point
Checkpoint strategies:
| Strategy | How it works | Trade-off |
|---|---|---|
| Time-based | Checkpoint every N minutes | Simple but unpredictable I/O |
| Size-based | Checkpoint when WAL reaches N MB | Bounds recovery time |
| Fuzzy checkpoint | Flush pages incrementally | Less I/O spike, more complex |
PostgreSQL uses a combination of size-based and time-based checkpointing controlled by checkpoint_timeout and max_wal_size.
Log-structured storage and LSM trees#
WAL principles extend beyond crash recovery. Log-structured merge trees (LSM trees) use a write-ahead approach as the primary storage model:
- Writes go to an in-memory buffer (memtable)
- The memtable is backed by a WAL for durability
- When the memtable fills, it is flushed to disk as a sorted SSTable
- Background compaction merges SSTables to reclaim space
Databases like RocksDB, LevelDB, Cassandra, and HBase all use this pattern. The WAL here is not just for recovery — it is the write path.
WAL in PostgreSQL#
PostgreSQL writes WAL records to segment files (16 MB each by default) in pg_wal/. Key details:
- Full-page writes: after a checkpoint, the first modification to a page writes the entire page image to WAL, preventing torn-page corruption
- WAL levels:
minimal,replica(enables streaming replication),logical(enables logical decoding for CDC) - pg_rewind: uses WAL to resynchronize a diverged standby without a full copy
- Archiving:
archive_commandships completed WAL segments to backup storage for point-in-time recovery
WAL in MySQL (InnoDB)#
InnoDB calls its WAL the redo log. It works as a fixed-size circular buffer:
- Two or more redo log files (
ib_logfile0,ib_logfile1) - Writes wrap around; the oldest entries are overwritten after checkpointing
innodb_flush_log_at_trx_commitcontrols durability:1= flush every commit (safest),2= flush to OS cache,0= flush every second
InnoDB also has an undo log for MVCC and rollback, stored in the system tablespace.
WAL in SQLite#
SQLite offers two journal modes:
- Rollback journal — copies original pages before modification; on crash, the journal restores the original state
- WAL mode — appends changes to a separate WAL file; readers see the last checkpoint while writers append new changes
WAL mode in SQLite enables concurrent readers and a single writer without blocking, which is why it is the default for most mobile and embedded applications.
Log compaction#
In systems that use the log as the source of truth (Kafka, event-sourced systems), logs grow indefinitely. Compaction reclaims space:
- Segment deletion: remove segments older than a retention period
- Key-based compaction: keep only the latest value for each key (Kafka log compaction)
- Merge compaction: combine smaller segments into larger ones, discarding obsolete entries (LSM tree compaction)
Kafka's log compaction is particularly useful for maintaining a changelog: consumers that read from the beginning get the latest state for every key without replaying the full history.
Change data capture from WAL#
The WAL is not just for recovery — it is a real-time stream of every change in the database. Change data capture (CDC) tools tap into this stream:
Debezium reads the WAL of PostgreSQL, MySQL, MongoDB, and others to produce change events into Kafka:
- No polling, no triggers, no application-level changes
- Captures inserts, updates, and deletes with before/after values
- Maintains a consistent snapshot of the source database
Common CDC pipeline:
Database WAL → Debezium → Kafka → downstream consumers (search index, cache, analytics)
This pattern eliminates dual-write problems: instead of writing to both the database and a message queue, you write to the database and let CDC propagate the change.
WAL performance considerations#
- fsync frequency: flushing every commit is safest but slowest. Batching commits (
commit_delayin PostgreSQL) amortizes the cost - Disk throughput: WAL is sequential, so SSDs and NVMe provide diminishing returns over HDDs compared to random I/O workloads
- WAL size: oversized WAL increases recovery time; undersized WAL causes too-frequent checkpoints
- Compression: some systems (RocksDB) compress WAL entries to reduce I/O
When WAL is not enough#
WAL guarantees durability on a single node. For multi-node durability, you need:
- Synchronous replication — ship WAL to a standby before acknowledging commit
- Consensus protocols — Raft or Paxos to agree on log order across nodes
- Distributed WAL — systems like CockroachDB use a Raft-replicated WAL per range
Visualize WAL in your architecture#
On Codelit, generate a PostgreSQL replication setup or a Kafka CDC pipeline to see how WAL flows through the system. Click on the WAL component to explore checkpointing, replication, and recovery paths.
This is article #228 in the Codelit engineering blog series.
Build and explore database architectures visually at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Build this architecture
Generate an interactive architecture for Write in seconds.
Try it in Codelit →
Comments