databasesfundamentalssystem-designdistributed-systems

Write-Ahead Log — How Databases Guarantee Durability and Crash Recovery

March 29, 2026 6 min readBy Codelit Team Discussion

What is a write-ahead log?#

A write-ahead log (WAL) is a sequential, append-only file where a database records every change before applying it to the actual data files. If the system crashes mid-write, the database replays the log on startup to recover a consistent state.

The rule is simple: write the log first, then modify the data. No exceptions.

Why WAL exists#

Without WAL, a crash during a page write can leave data half-written — a torn page. The database would have no way to know which writes completed and which did not.

WAL solves this by guaranteeing that:

Every modification is recorded in a durable log before it touches disk
The log is sequential (fast on any storage medium)
Recovery replays committed transactions and discards incomplete ones

How WAL works step by step#

Transaction begins — the database assigns a log sequence number (LSN)
Log record written — the change (old value, new value, page ID) is appended to the WAL
WAL flushed to disk — fsync() ensures the record is durable
Acknowledge commit — the client gets a success response
Background flush — dirty pages are written to data files asynchronously

The key insight: the data files can lag behind. As long as the WAL is intact, every committed transaction can be reconstructed.

Crash recovery#

On startup after a crash, the database:

Reads the WAL from the last checkpoint
Redo phase — replays all committed transactions whose data pages may not have been flushed
Undo phase — rolls back any transactions that were in progress but never committed
Resumes normal operation

This is called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), the industry-standard recovery protocol used by most relational databases.

Checkpointing#

If the database replayed the entire WAL on every restart, recovery would take forever. Checkpoints solve this.

A checkpoint:

Flushes all dirty pages to disk
Records the current LSN as the checkpoint position
Allows the database to truncate or recycle WAL segments before that point

Checkpoint strategies:

Strategy	How it works	Trade-off
Time-based	Checkpoint every N minutes	Simple but unpredictable I/O
Size-based	Checkpoint when WAL reaches N MB	Bounds recovery time
Fuzzy checkpoint	Flush pages incrementally	Less I/O spike, more complex

PostgreSQL uses a combination of size-based and time-based checkpointing controlled by checkpoint_timeout and max_wal_size.

Log-structured storage and LSM trees#

WAL principles extend beyond crash recovery. Log-structured merge trees (LSM trees) use a write-ahead approach as the primary storage model:

Writes go to an in-memory buffer (memtable)
The memtable is backed by a WAL for durability
When the memtable fills, it is flushed to disk as a sorted SSTable
Background compaction merges SSTables to reclaim space

Databases like RocksDB, LevelDB, Cassandra, and HBase all use this pattern. The WAL here is not just for recovery — it is the write path.

WAL in PostgreSQL#

PostgreSQL writes WAL records to segment files (16 MB each by default) in pg_wal/. Key details:

Full-page writes: after a checkpoint, the first modification to a page writes the entire page image to WAL, preventing torn-page corruption
WAL levels: minimal, replica (enables streaming replication), logical (enables logical decoding for CDC)
pg_rewind: uses WAL to resynchronize a diverged standby without a full copy
Archiving: archive_command ships completed WAL segments to backup storage for point-in-time recovery

WAL in MySQL (InnoDB)#

InnoDB calls its WAL the redo log. It works as a fixed-size circular buffer:

Two or more redo log files (ib_logfile0, ib_logfile1)
Writes wrap around; the oldest entries are overwritten after checkpointing
innodb_flush_log_at_trx_commit controls durability: 1 = flush every commit (safest), 2 = flush to OS cache, 0 = flush every second

InnoDB also has an undo log for MVCC and rollback, stored in the system tablespace.

WAL in SQLite#

SQLite offers two journal modes:

Rollback journal — copies original pages before modification; on crash, the journal restores the original state
WAL mode — appends changes to a separate WAL file; readers see the last checkpoint while writers append new changes

WAL mode in SQLite enables concurrent readers and a single writer without blocking, which is why it is the default for most mobile and embedded applications.

Log compaction#

In systems that use the log as the source of truth (Kafka, event-sourced systems), logs grow indefinitely. Compaction reclaims space:

Segment deletion: remove segments older than a retention period
Key-based compaction: keep only the latest value for each key (Kafka log compaction)
Merge compaction: combine smaller segments into larger ones, discarding obsolete entries (LSM tree compaction)

Kafka's log compaction is particularly useful for maintaining a changelog: consumers that read from the beginning get the latest state for every key without replaying the full history.

Change data capture from WAL#

The WAL is not just for recovery — it is a real-time stream of every change in the database. Change data capture (CDC) tools tap into this stream:

Debezium reads the WAL of PostgreSQL, MySQL, MongoDB, and others to produce change events into Kafka:

No polling, no triggers, no application-level changes
Captures inserts, updates, and deletes with before/after values
Maintains a consistent snapshot of the source database

Common CDC pipeline:

Database WAL → Debezium → Kafka → downstream consumers (search index, cache, analytics)

This pattern eliminates dual-write problems: instead of writing to both the database and a message queue, you write to the database and let CDC propagate the change.

WAL performance considerations#

fsync frequency: flushing every commit is safest but slowest. Batching commits (commit_delay in PostgreSQL) amortizes the cost
Disk throughput: WAL is sequential, so SSDs and NVMe provide diminishing returns over HDDs compared to random I/O workloads
WAL size: oversized WAL increases recovery time; undersized WAL causes too-frequent checkpoints
Compression: some systems (RocksDB) compress WAL entries to reduce I/O

When WAL is not enough#

WAL guarantees durability on a single node. For multi-node durability, you need:

Synchronous replication — ship WAL to a standby before acknowledging commit
Consensus protocols — Raft or Paxos to agree on log order across nodes
Distributed WAL — systems like CockroachDB use a Raft-replicated WAL per range

Visualize WAL in your architecture#

On Codelit, generate a PostgreSQL replication setup or a Kafka CDC pipeline to see how WAL flows through the system. Click on the WAL component to explore checkpointing, replication, and recovery paths.

This is article #228 in the Codelit engineering blog series.

Build and explore database architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Build this architecture

Generate an interactive architecture for Write in seconds.

Try it in Codelit →

databasesfundamentalssystem-designdistributed-systems

Write-Ahead Log — How Databases Guarantee Durability and Crash Recovery

March 29, 2026 6 min readBy Codelit Team Discussion

What is a write-ahead log?#

The rule is simple: write the log first, then modify the data. No exceptions.

Why WAL exists#

Without WAL, a crash during a page write can leave data half-written — a torn page. The database would have no way to know which writes completed and which did not.

WAL solves this by guaranteeing that:

Every modification is recorded in a durable log before it touches disk
The log is sequential (fast on any storage medium)
Recovery replays committed transactions and discards incomplete ones

How WAL works step by step#

Transaction begins — the database assigns a log sequence number (LSN)
Log record written — the change (old value, new value, page ID) is appended to the WAL
WAL flushed to disk — fsync() ensures the record is durable
Acknowledge commit — the client gets a success response
Background flush — dirty pages are written to data files asynchronously

The key insight: the data files can lag behind. As long as the WAL is intact, every committed transaction can be reconstructed.

Crash recovery#

On startup after a crash, the database:

Reads the WAL from the last checkpoint
Redo phase — replays all committed transactions whose data pages may not have been flushed
Undo phase — rolls back any transactions that were in progress but never committed
Resumes normal operation

This is called ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), the industry-standard recovery protocol used by most relational databases.

Checkpointing#

If the database replayed the entire WAL on every restart, recovery would take forever. Checkpoints solve this.

A checkpoint:

Flushes all dirty pages to disk
Records the current LSN as the checkpoint position
Allows the database to truncate or recycle WAL segments before that point

Checkpoint strategies:

Strategy	How it works	Trade-off
Time-based	Checkpoint every N minutes	Simple but unpredictable I/O
Size-based	Checkpoint when WAL reaches N MB	Bounds recovery time
Fuzzy checkpoint	Flush pages incrementally	Less I/O spike, more complex

PostgreSQL uses a combination of size-based and time-based checkpointing controlled by checkpoint_timeout and max_wal_size.

Log-structured storage and LSM trees#

WAL principles extend beyond crash recovery. Log-structured merge trees (LSM trees) use a write-ahead approach as the primary storage model:

Writes go to an in-memory buffer (memtable)
The memtable is backed by a WAL for durability
When the memtable fills, it is flushed to disk as a sorted SSTable
Background compaction merges SSTables to reclaim space

Databases like RocksDB, LevelDB, Cassandra, and HBase all use this pattern. The WAL here is not just for recovery — it is the write path.

WAL in PostgreSQL#

PostgreSQL writes WAL records to segment files (16 MB each by default) in pg_wal/. Key details:

Full-page writes: after a checkpoint, the first modification to a page writes the entire page image to WAL, preventing torn-page corruption
WAL levels: minimal, replica (enables streaming replication), logical (enables logical decoding for CDC)
pg_rewind: uses WAL to resynchronize a diverged standby without a full copy
Archiving: archive_command ships completed WAL segments to backup storage for point-in-time recovery

WAL in MySQL (InnoDB)#

InnoDB calls its WAL the redo log. It works as a fixed-size circular buffer:

Two or more redo log files (ib_logfile0, ib_logfile1)
Writes wrap around; the oldest entries are overwritten after checkpointing
innodb_flush_log_at_trx_commit controls durability: 1 = flush every commit (safest), 2 = flush to OS cache, 0 = flush every second

InnoDB also has an undo log for MVCC and rollback, stored in the system tablespace.

WAL in SQLite#

SQLite offers two journal modes:

Rollback journal — copies original pages before modification; on crash, the journal restores the original state
WAL mode — appends changes to a separate WAL file; readers see the last checkpoint while writers append new changes

WAL mode in SQLite enables concurrent readers and a single writer without blocking, which is why it is the default for most mobile and embedded applications.

Log compaction#

In systems that use the log as the source of truth (Kafka, event-sourced systems), logs grow indefinitely. Compaction reclaims space:

Segment deletion: remove segments older than a retention period
Key-based compaction: keep only the latest value for each key (Kafka log compaction)
Merge compaction: combine smaller segments into larger ones, discarding obsolete entries (LSM tree compaction)

Kafka's log compaction is particularly useful for maintaining a changelog: consumers that read from the beginning get the latest state for every key without replaying the full history.

Change data capture from WAL#

The WAL is not just for recovery — it is a real-time stream of every change in the database. Change data capture (CDC) tools tap into this stream:

Debezium reads the WAL of PostgreSQL, MySQL, MongoDB, and others to produce change events into Kafka:

No polling, no triggers, no application-level changes
Captures inserts, updates, and deletes with before/after values
Maintains a consistent snapshot of the source database

Common CDC pipeline:

Database WAL → Debezium → Kafka → downstream consumers (search index, cache, analytics)

This pattern eliminates dual-write problems: instead of writing to both the database and a message queue, you write to the database and let CDC propagate the change.

WAL performance considerations#

fsync frequency: flushing every commit is safest but slowest. Batching commits (commit_delay in PostgreSQL) amortizes the cost
Disk throughput: WAL is sequential, so SSDs and NVMe provide diminishing returns over HDDs compared to random I/O workloads
WAL size: oversized WAL increases recovery time; undersized WAL causes too-frequent checkpoints
Compression: some systems (RocksDB) compress WAL entries to reduce I/O

When WAL is not enough#

WAL guarantees durability on a single node. For multi-node durability, you need:

Synchronous replication — ship WAL to a standby before acknowledging commit
Consensus protocols — Raft or Paxos to agree on log order across nodes
Distributed WAL — systems like CockroachDB use a Raft-replicated WAL per range

Visualize WAL in your architecture#

This is article #228 in the Codelit engineering blog series.

Build and explore database architectures visually at codelit.io.

{ }

Explore the WhatsApp architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

api design

Build this architecture

Generate an interactive architecture for Write in seconds.

Try it in Codelit →

Write-Ahead Log — How Databases Guarantee Durability and Crash Recovery

What is a write-ahead log?#

Why WAL exists#

How WAL works step by step#

Crash recovery#

Checkpointing#

Log-structured storage and LSM trees#

WAL in PostgreSQL#

WAL in MySQL (InnoDB)#

WAL in SQLite#

Log compaction#

Change data capture from WAL#

WAL performance considerations#

When WAL is not enough#

Visualize WAL in your architecture#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Build this architecture

Write-Ahead Log — How Databases Guarantee Durability and Crash Recovery

What is a write-ahead log?#

Why WAL exists#

How WAL works step by step#

Crash recovery#

Checkpointing#

Log-structured storage and LSM trees#

WAL in PostgreSQL#

WAL in MySQL (InnoDB)#

WAL in SQLite#

Log compaction#

Change data capture from WAL#

WAL performance considerations#

When WAL is not enough#

Visualize WAL in your architecture#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Build this architecture