Change Data Capture (CDC): Stream Database Changes in Real Time
Change Data Capture (CDC)#
Every database mutation tells a story. Change Data Capture turns those mutations into a real-time event stream — enabling downstream systems to react to changes the instant they happen.
Why CDC Matters#
Traditional data integration relies on batch ETL jobs that run hourly or nightly. CDC flips this model:
Batch ETL: Source DB → (wait hours) → ETL job → Target
CDC: Source DB → (milliseconds) → Stream → Target
Use cases that demand CDC:
- Real-time analytics — dashboards that reflect the last second, not last hour
- Cache invalidation — update Redis the moment a row changes
- Search index sync — keep Elasticsearch in lockstep with your database
- Microservice data propagation — share state without coupling services
- Audit logs — capture every mutation for compliance
CDC Patterns#
1. Log-Based CDC#
Databases already record every change in a write-ahead log (WAL) or binary log. Log-based CDC taps into this stream directly.
PostgreSQL WAL / MySQL binlog / MongoDB oplog
│
▼
CDC Connector (Debezium)
│
▼
Kafka / Event Stream
│
▼
Consumers (analytics, cache, search)
Advantages:
- Zero impact on source database performance
- Captures every change (no missed updates between polls)
- Preserves operation type (INSERT, UPDATE, DELETE)
- Includes before-and-after snapshots of rows
Debezium example configuration:
{
"name": "inventory-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.example.com",
"database.port": "5432",
"database.user": "cdc_user",
"database.dbname": "inventory",
"table.include.list": "public.orders,public.products",
"topic.prefix": "inventory",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot"
}
}
2. Trigger-Based CDC#
Database triggers fire on INSERT, UPDATE, or DELETE and write change records to a shadow table.
CREATE TRIGGER orders_cdc_trigger
AFTER INSERT OR UPDATE OR DELETE ON orders
FOR EACH ROW
EXECUTE FUNCTION capture_change();
Trade-offs:
- Works on any database that supports triggers
- Adds write overhead to every mutation
- Requires schema changes (shadow tables)
- Can miss changes from bulk operations or schema migrations
3. Polling (Timestamp-Based)#
A process periodically queries for rows with updated_at greater than the last checkpoint.
SELECT * FROM orders
WHERE updated_at > :last_checkpoint
ORDER BY updated_at ASC
LIMIT 1000;
Trade-offs:
- Simple to implement — no special database permissions needed
- Misses deletes entirely (no row to query)
- Cannot detect multiple rapid updates to the same row
- Polling interval creates inherent latency
4. Comparison of Patterns#
┌─────────────────┬────────────┬──────────┬─────────────┐
│ Pattern │ Latency │ DB Load │ Delete Aware │
├─────────────────┼────────────┼──────────┼─────────────┤
│ Log-based │ ~seconds │ None │ Yes │
│ Trigger-based │ ~seconds │ Medium │ Yes │
│ Polling │ ~minutes │ High │ No │
└─────────────────┴────────────┴──────────┴─────────────┘
The Dual-Write Problem#
The most dangerous anti-pattern in distributed systems: writing to two systems and assuming both succeed.
// DANGEROUS — dual write
await database.save(order);
await kafka.publish("order.created", order);
// What if Kafka publish fails? Systems are now inconsistent.
CDC eliminates dual writes by making the database the single source of truth. Downstream systems read from the CDC stream — you only write to one place.
// SAFE — single write + CDC
await database.save(order);
// Debezium captures the INSERT from the WAL
// Kafka consumers receive the event automatically
Outbox Pattern#
When you need guaranteed event publishing alongside a database write, use the transactional outbox:
BEGIN TRANSACTION;
INSERT INTO orders (id, ...) VALUES (...);
INSERT INTO outbox (aggregate_id, event_type, payload)
VALUES (order_id, 'OrderCreated', '{"..."}');
COMMIT;
Debezium reads the outbox table via CDC and publishes events to Kafka. Both the order and the event are committed atomically.
Event Sourcing via CDC#
CDC enables a pragmatic path to event sourcing without rewriting your application:
Traditional DB (CRUD)
│
▼ CDC stream
│
▼
Event Store / Kafka topic (append-only log)
│
▼
Materialized views, projections, read models
This gives you event sourcing benefits — full audit trail, temporal queries, replay — while keeping your existing CRUD application.
CDC Tools Comparison#
Debezium#
- Open-source, Kafka Connect-based
- Supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Cassandra
- Exactly-once semantics with Kafka transactions
- Most mature and widely adopted
Maxwell#
- Lightweight MySQL-only CDC
- Reads MySQL binlog, outputs JSON to Kafka, Kinesis, or stdout
- Simpler setup than Debezium for MySQL-only environments
DynamoDB Streams#
- Native CDC for AWS DynamoDB
- 24-hour retention window
- Integrates with Lambda for serverless processing
- Guaranteed ordering per partition key
Additional Tools#
- AWS Database Migration Service (DMS) — managed CDC for AWS databases
- Google Datastream — serverless CDC for BigQuery and Cloud SQL
- Striim — enterprise CDC with built-in transformations
- Airbyte — open-source ELT with CDC connectors
Real-Time Sync Architecture#
A production CDC pipeline for keeping search and cache in sync:
PostgreSQL
│
▼ Debezium (WAL reader)
│
▼ Kafka (orders.cdc topic)
│
├──▶ Elasticsearch Sink Connector → search index
├──▶ Redis Sink Connector → cache layer
├──▶ Analytics Consumer → data warehouse
└──▶ Notification Service → user alerts
Key operational concerns:#
Schema evolution — use a schema registry (Confluent or Apicurio) to manage Avro/Protobuf schemas as your tables evolve.
Snapshotting — when you first start a CDC connector, it performs an initial snapshot of existing data before streaming changes.
Exactly-once delivery — combine Kafka transactions with idempotent consumers to prevent duplicate processing.
Monitoring — track replication lag, connector status, and consumer group offsets. Alert when lag exceeds your SLA.
When NOT to Use CDC#
- Simple CRUD apps with a single database and no downstream consumers
- Batch analytics where hourly freshness is acceptable (cheaper to run scheduled queries)
- Databases without WAL access — some managed databases restrict replication slot access
Quick Start Checklist#
- Enable logical replication on your database (PostgreSQL:
wal_level = logical) - Create a dedicated CDC user with replication permissions
- Deploy Debezium via Kafka Connect or Debezium Server
- Configure table filters to capture only what you need
- Set up a schema registry for schema evolution
- Build idempotent consumers that handle duplicates gracefully
- Monitor replication lag and connector health
Change Data Capture transforms your database into a real-time event source — solving the dual-write problem, enabling event-driven architectures, and keeping every downstream system in sync. Start with log-based CDC via Debezium and expand from there.
Want to master distributed systems patterns? Explore 243 engineering articles on codelit.io.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsReal-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsBuild this architecture
Generate an interactive architecture for Change Data Capture (CDC) in seconds.
Try it in Codelit →
Comments