CDC Streaming — Real-Time Data Pipelines with Debezium & Kafka
What is Change Data Capture?#
Change Data Capture (CDC) is the process of identifying and capturing changes made to data in a database, then delivering those changes in real time to downstream consumers. Instead of polling a table every N seconds, CDC reads the database transaction log and streams every INSERT, UPDATE, and DELETE as an event.
Why it matters: batch ETL jobs introduce hours of latency. CDC pipelines deliver changes in seconds.
Log-based CDC vs. other approaches#
There are three common CDC strategies:
Trigger-based CDC — database triggers write changes to a shadow table. Simple but adds write overhead to every transaction and couples your schema to the capture logic.
Query-based CDC — a poller runs SELECT * WHERE updated_at > ? on an interval. Misses deletes, can miss rapid updates between polls, and hammers the database with queries.
Log-based CDC — reads the database write-ahead log (WAL in Postgres, binlog in MySQL, oplog in MongoDB). Zero impact on application queries, captures every change including deletes, and preserves transaction ordering.
Log-based CDC is the gold standard for production pipelines.
Debezium: the open-source CDC engine#
Debezium is a distributed platform built on top of Kafka Connect that turns database logs into event streams. It supports:
- PostgreSQL — logical replication slots
- MySQL / MariaDB — binlog reading
- MongoDB — oplog tailing
- SQL Server — CDC tables
- Oracle — LogMiner or XStream
- Cassandra — commit log
Each connector runs as a Kafka Connect source connector, producing one Kafka topic per table.
Architecture of a CDC pipeline#
A typical CDC pipeline has four layers:
Source database — the operational database where your application writes data. You configure a replication slot (Postgres) or enable binlog (MySQL).
Debezium connector — runs inside Kafka Connect. Reads the transaction log, converts each change into a structured event envelope, and publishes to Kafka.
Kafka topics — one topic per table (e.g., dbserver1.public.orders). Events are keyed by the row's primary key, so updates to the same row land in the same partition.
Consumers — downstream services, materialized views, search indexes, data warehouses, or analytics engines that subscribe to topics and process change events.
The Debezium event envelope#
Every change event contains:
{
"before": { "id": 42, "status": "pending" },
"after": { "id": 42, "status": "shipped" },
"source": {
"connector": "postgresql",
"db": "orders_db",
"table": "orders",
"lsn": 234881104,
"txId": 98712
},
"op": "u",
"ts_ms": 1711670400000
}
The op field indicates the operation: c (create), u (update), d (delete), or r (snapshot read). The before and after fields give you the full row state.
Setting up Debezium with Kafka Connect#
Step 1 — enable logical replication in Postgres:
ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;
Step 2 — create a connector configuration:
{
"name": "orders-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.internal",
"database.port": "5432",
"database.user": "cdc_user",
"database.dbname": "orders_db",
"topic.prefix": "dbserver1",
"table.include.list": "public.orders,public.customers",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot"
}
}
Step 3 — deploy via the Kafka Connect REST API:
curl -X POST http://connect:8083/connectors \
-H "Content-Type: application/json" \
-d @orders-connector.json
Real-time materialized views#
One of the most powerful CDC patterns is maintaining materialized views in a different data store. Example: your orders live in Postgres, but you need a denormalized view in Elasticsearch for full-text search.
A Kafka Streams or Flink job consumes the CDC topics, joins orders with customers, and writes the denormalized document to Elasticsearch. The view updates within seconds of a change in Postgres.
This replaces fragile dual-write patterns where the application writes to both Postgres and Elasticsearch.
Event-driven sync between databases#
CDC enables event-driven data sync without application-level coordination:
- Postgres to data warehouse — stream changes to Snowflake, BigQuery, or Redshift via a Kafka sink connector
- Postgres to cache — invalidate or update Redis keys when rows change
- Cross-service sync — the Orders service writes to its database; the Shipping service reacts to order events without direct API coupling
- Audit logging — every change event is an immutable audit record
Handling schema evolution#
Schemas change. Debezium integrates with the Confluent Schema Registry (or Apicurio) to manage Avro or Protobuf schemas. When a column is added, the schema evolves automatically. Consumers using schema-aware deserializers handle new fields gracefully.
Critical rule: never rename or drop a column without coordinating with downstream consumers. Use a two-phase migration — add the new column, migrate consumers, then drop the old column.
Exactly-once semantics#
CDC pipelines can achieve exactly-once delivery with:
- Kafka transactions — Debezium 2.x supports exactly-once source connectors with Kafka transactions
- Idempotent consumers — consumers use the event's LSN or transaction ID to deduplicate
- Outbox pattern — write events to an outbox table inside the same transaction as your data change, then CDC picks up the outbox table
The outbox pattern is especially useful for microservices that need to publish domain events reliably.
Monitoring and operations#
Key metrics to monitor:
- Replication slot lag — if the slot falls behind, you risk WAL bloat in Postgres
- Consumer lag — Kafka consumer group lag per partition
- Connector status — Kafka Connect exposes connector health via REST API
- Throughput — events per second through each connector
Set alerts on replication slot lag exceeding a threshold (e.g., 100 MB). A stuck connector can cause Postgres to retain WAL indefinitely, filling your disk.
Common pitfalls#
WAL disk bloat — a paused or failed connector holds a replication slot open. Postgres retains WAL segments until the slot catches up. Monitor pg_replication_slots and drop orphaned slots.
Large transactions — a single transaction that updates 1 million rows produces 1 million events. This can spike memory in both the connector and consumers. Batch large migrations separately from CDC.
Snapshot on restart — Debezium snapshots the entire table on first start. For large tables, this can take hours and produce massive initial load. Use snapshot.mode=schema_only if you only need changes going forward.
Schema drift — if the source schema changes without the registry being updated, consumers may fail to deserialize events.
Alternatives to Debezium#
- AWS DMS — managed CDC for AWS databases, simpler but less flexible
- Google Datastream — managed CDC to BigQuery and Cloud Storage
- Striim — commercial CDC platform with a visual pipeline builder
- Materialize — streams CDC directly into incrementally maintained SQL views
- Estuary Flow — real-time CDC with a declarative pipeline model
Visualize your CDC architecture#
Map your CDC pipeline end-to-end — from source database through Kafka to consumers — with Codelit. Generate an interactive architecture diagram in seconds.
Key takeaways#
- Log-based CDC reads the WAL/binlog for zero-impact, complete change capture
- Debezium + Kafka Connect is the open-source standard for CDC pipelines
- One topic per table, keyed by primary key, preserves ordering
- Materialized views in Elasticsearch, Redis, or warehouses replace fragile dual writes
- Monitor replication slot lag — a stuck slot can fill your Postgres disk
- The outbox pattern ensures reliable event publishing within transactions
- This is article #381 of our ongoing system design series
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsReal-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsBuild this architecture
Generate an interactive architecture for CDC Streaming in seconds.
Try it in Codelit →
Comments