Write-Heavy System Design: Patterns for High-Throughput Ingestion
Some systems are defined by their write volume. Logging pipelines, IoT telemetry, financial transaction ledgers, analytics collectors, and chat platforms all share a common challenge: they must ingest massive volumes of data without dropping writes or degrading latency. This guide covers the core patterns that make write-heavy systems work at scale.
The Write Bottleneck#
Traditional relational databases use B-tree indexes. Every write must update the tree in place, which involves random I/O — seeking to the correct page, modifying it, and flushing to disk. At high write rates, random I/O becomes the bottleneck. The strategies below all share a common theme: convert random writes into sequential writes.
Write-Ahead Log (WAL)#
The write-ahead log is the foundation of durable writes in nearly every database. Before modifying data pages, the database appends the change to a sequential log on disk.
Client write → Append to WAL (sequential) → Acknowledge client
│
Background: apply to data pages
Why it matters:
- Sequential writes are 10-100x faster than random writes on both SSDs and spinning disks.
- The WAL guarantees durability — if the process crashes, it replays the log on startup.
- PostgreSQL, MySQL, SQLite, and every major database uses a WAL internally.
In write-heavy systems, you can tune WAL behavior: group commits batch multiple transactions into a single fsync, reducing the per-write cost of durability.
LSM Trees (Log-Structured Merge Trees)#
LSM trees take the WAL concept further and build the entire storage engine around sequential writes.
Write → MemTable (in-memory sorted buffer)
│
▼ flush when full
SSTable (immutable sorted file on disk)
│
▼ compaction merges SSTables
Larger SSTable (sorted, deduplicated)
How it works:
- Writes go to an in-memory buffer (memtable). This is extremely fast.
- When the memtable fills, it flushes to disk as an immutable sorted file (SSTable).
- Background compaction merges SSTables, removing duplicates and tombstones.
Databases using LSM trees: Cassandra, RocksDB, LevelDB, HBase, ScyllaDB.
Trade-offs:
- Writes are very fast — always sequential, always append-only.
- Reads may be slower because data can be spread across multiple SSTables (mitigated by bloom filters and block indexes).
- Compaction consumes CPU and I/O in the background — tuning compaction strategy (size-tiered vs. leveled) is critical.
Batch Writes#
Individual writes carry per-operation overhead: network round trips, transaction logging, index updates. Batching amortizes this cost across many records.
// Instead of 1000 individual INSERTs:
INSERT INTO events (id, data) VALUES (1, '...');
INSERT INTO events (id, data) VALUES (2, '...');
...
// Use a single batch:
INSERT INTO events (id, data) VALUES
(1, '...'), (2, '...'), ..., (1000, '...');
Batch strategies:
- Size-based — Flush when the buffer reaches N records (e.g., 1000).
- Time-based — Flush every T milliseconds (e.g., 100ms).
- Hybrid — Flush on whichever threshold is hit first.
Kafka producers, database bulk loaders, and analytics SDKs all use this pattern. The trade-off is a small increase in latency (buffering delay) for a large increase in throughput.
Async Processing#
Not every write needs to be processed synchronously. Async processing decouples the ingestion path from the processing path.
Client → API Server → Message Queue → Workers → Database
│
└─ Acknowledge immediately
The API server accepts the write, pushes it to a queue (Kafka, SQS, RabbitMQ), and acknowledges the client. Workers consume from the queue at their own pace. This provides:
- Back-pressure handling — The queue absorbs traffic spikes.
- Retry semantics — Failed writes are retried automatically.
- Decoupled scaling — Ingestion and processing scale independently.
Write Buffering#
Write buffering extends the batching concept by holding writes in a fast intermediate store before flushing to the final destination.
Writes → Redis (buffer) → Periodic flush → PostgreSQL
Example: a rate-limiting service increments counters in Redis (sub-millisecond writes) and flushes aggregated counts to PostgreSQL every 10 seconds. The database sees 1 write per 10-second window instead of thousands of individual increments.
This pattern is common in analytics (buffer page views), gaming (buffer score updates), and social platforms (buffer like counts).
Sharding for Writes#
When a single node cannot handle the write volume, you distribute writes across multiple nodes by sharding — partitioning data by a shard key.
┌──────────┐ hash(user_id) % 4
│ Client │─────────────────────────┐
└──────────┘ │
┌──────────┬──────────┬────────┴───────┐
▼ ▼ ▼ ▼
Shard 0 Shard 1 Shard 2 Shard 3
Shard key selection matters:
- High cardinality — The key should have many distinct values to distribute evenly.
- Write distribution — Avoid hot shards. A timestamp-based key sends all recent writes to one shard. Prefer user ID, device ID, or a composite key.
- Query patterns — Choose a key that keeps related data together for efficient reads.
Cassandra and DynamoDB are designed around this model. For relational databases, tools like Vitess (MySQL) and Citus (PostgreSQL) add sharding.
Event Sourcing#
Event sourcing stores every state change as an immutable event rather than overwriting current state.
Traditional: UPDATE account SET balance = 150 WHERE id = 42;
Event sourced:
Event 1: AccountCreated { id: 42, balance: 0 }
Event 2: Deposited { id: 42, amount: 200 }
Event 3: Withdrawn { id: 42, amount: 50 }
Current state: replay events → balance = 150
Benefits for write-heavy systems:
- Append-only — No updates, no locks, no contention. Every write is a sequential append.
- Complete audit trail — Every change is preserved. Critical for financial and compliance systems.
- Temporal queries — You can reconstruct state at any point in time.
- Event-driven architecture — Events naturally feed downstream consumers (notifications, analytics, search indexing).
Challenges: The event log grows indefinitely. Snapshots (periodic materialized state) prevent replaying the entire history on every read.
Append-Only Storage#
Append-only storage takes the event sourcing idea to the storage engine level. Data is never updated in place — new versions are appended, and old versions are garbage-collected or archived.
Systems built on append-only storage:
- Kafka — Topics are append-only logs. Consumers track their offset.
- Cassandra — Uses LSM trees internally; writes are always appends.
- Immutable databases — Datomic, XTDB store all facts with timestamps, never deleting.
The append-only model eliminates write contention entirely. Multiple writers can append concurrently without locking.
Time-Series Writes#
Time-series data (metrics, logs, sensor readings) has unique write patterns: high volume, timestamped, and rarely updated after ingestion.
Optimizations specific to time-series:
- Time-based partitioning — Partition tables by hour or day. Recent writes always go to the current partition, and old partitions can be compressed or archived.
- Column-oriented storage — TimescaleDB, ClickHouse, and InfluxDB store columns contiguously, enabling high compression ratios (10-20x) on repetitive metric data.
- Out-of-order handling — Late-arriving data is buffered and merged during compaction rather than triggering random I/O.
- Downsampling — Aggregate old data into lower-resolution rollups (1-minute averages become 1-hour averages) to bound storage growth.
Raw data: 1-second resolution → 86,400 points/day
Rollup 1: 1-minute resolution → 1,440 points/day
Rollup 2: 1-hour resolution → 24 points/day
Putting It All Together#
A production write-heavy pipeline typically combines several patterns:
Producers → Message Queue (Kafka)
│
▼
Stream Processor (batch + async)
│
▼
Write Buffer (in-memory)
│
▼ flush
Sharded Storage (LSM-based: Cassandra / ScyllaDB)
│
▼
Cold Storage (S3 / HDFS for archival)
Each layer absorbs complexity: the queue handles back-pressure, batching amortizes overhead, and LSM-based storage converts everything into sequential I/O.
Key Takeaways#
- Write-ahead logs convert random I/O into sequential appends for durability.
- LSM trees build entire storage engines around sequential write patterns.
- Batch writes amortize per-operation overhead across many records.
- Async processing decouples ingestion from processing and absorbs spikes.
- Write buffering aggregates high-frequency writes before flushing to slower stores.
- Sharding distributes write volume horizontally when a single node is not enough.
- Event sourcing and append-only storage eliminate update contention entirely.
- Time-series optimizations — partitioning, columnar storage, and downsampling — tame the highest write volumes.
If you found this guide helpful, explore our full library of 252 system design and engineering articles at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for Write in seconds.
Try it in Codelit →
Comments