databasesstreamingkafkadata-engineering

CDC Streaming — Real-Time Data Pipelines with Debezium & Kafka

March 29, 2026 7 min readBy Codelit Team Discussion

What is Change Data Capture?#

Change Data Capture (CDC) is the process of identifying and capturing changes made to data in a database, then delivering those changes in real time to downstream consumers. Instead of polling a table every N seconds, CDC reads the database transaction log and streams every INSERT, UPDATE, and DELETE as an event.

Why it matters: batch ETL jobs introduce hours of latency. CDC pipelines deliver changes in seconds.

Log-based CDC vs. other approaches#

There are three common CDC strategies:

Trigger-based CDC — database triggers write changes to a shadow table. Simple but adds write overhead to every transaction and couples your schema to the capture logic.

Query-based CDC — a poller runs SELECT * WHERE updated_at > ? on an interval. Misses deletes, can miss rapid updates between polls, and hammers the database with queries.

Log-based CDC — reads the database write-ahead log (WAL in Postgres, binlog in MySQL, oplog in MongoDB). Zero impact on application queries, captures every change including deletes, and preserves transaction ordering.

Log-based CDC is the gold standard for production pipelines.

Debezium: the open-source CDC engine#

Debezium is a distributed platform built on top of Kafka Connect that turns database logs into event streams. It supports:

PostgreSQL — logical replication slots
MySQL / MariaDB — binlog reading
MongoDB — oplog tailing
SQL Server — CDC tables
Oracle — LogMiner or XStream
Cassandra — commit log

Each connector runs as a Kafka Connect source connector, producing one Kafka topic per table.

Architecture of a CDC pipeline#

A typical CDC pipeline has four layers:

Source database — the operational database where your application writes data. You configure a replication slot (Postgres) or enable binlog (MySQL).

Debezium connector — runs inside Kafka Connect. Reads the transaction log, converts each change into a structured event envelope, and publishes to Kafka.

Kafka topics — one topic per table (e.g., dbserver1.public.orders). Events are keyed by the row's primary key, so updates to the same row land in the same partition.

Consumers — downstream services, materialized views, search indexes, data warehouses, or analytics engines that subscribe to topics and process change events.

The Debezium event envelope#

Every change event contains:

{
  "before": { "id": 42, "status": "pending" },
  "after":  { "id": 42, "status": "shipped" },
  "source": {
    "connector": "postgresql",
    "db": "orders_db",
    "table": "orders",
    "lsn": 234881104,
    "txId": 98712
  },
  "op": "u",
  "ts_ms": 1711670400000
}

The op field indicates the operation: c (create), u (update), d (delete), or r (snapshot read). The before and after fields give you the full row state.

Setting up Debezium with Kafka Connect#

Step 1 — enable logical replication in Postgres:

ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;

Step 2 — create a connector configuration:

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.dbname": "orders_db",
    "topic.prefix": "dbserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot"
  }
}

Step 3 — deploy via the Kafka Connect REST API:

curl -X POST http://connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @orders-connector.json

Real-time materialized views#

One of the most powerful CDC patterns is maintaining materialized views in a different data store. Example: your orders live in Postgres, but you need a denormalized view in Elasticsearch for full-text search.

A Kafka Streams or Flink job consumes the CDC topics, joins orders with customers, and writes the denormalized document to Elasticsearch. The view updates within seconds of a change in Postgres.

This replaces fragile dual-write patterns where the application writes to both Postgres and Elasticsearch.

Event-driven sync between databases#

CDC enables event-driven data sync without application-level coordination:

Postgres to data warehouse — stream changes to Snowflake, BigQuery, or Redshift via a Kafka sink connector
Postgres to cache — invalidate or update Redis keys when rows change
Cross-service sync — the Orders service writes to its database; the Shipping service reacts to order events without direct API coupling
Audit logging — every change event is an immutable audit record

Handling schema evolution#

Schemas change. Debezium integrates with the Confluent Schema Registry (or Apicurio) to manage Avro or Protobuf schemas. When a column is added, the schema evolves automatically. Consumers using schema-aware deserializers handle new fields gracefully.

Critical rule: never rename or drop a column without coordinating with downstream consumers. Use a two-phase migration — add the new column, migrate consumers, then drop the old column.

Exactly-once semantics#

CDC pipelines can achieve exactly-once delivery with:

Kafka transactions — Debezium 2.x supports exactly-once source connectors with Kafka transactions
Idempotent consumers — consumers use the event's LSN or transaction ID to deduplicate
Outbox pattern — write events to an outbox table inside the same transaction as your data change, then CDC picks up the outbox table

The outbox pattern is especially useful for microservices that need to publish domain events reliably.

Monitoring and operations#

Key metrics to monitor:

Replication slot lag — if the slot falls behind, you risk WAL bloat in Postgres
Consumer lag — Kafka consumer group lag per partition
Connector status — Kafka Connect exposes connector health via REST API
Throughput — events per second through each connector

Set alerts on replication slot lag exceeding a threshold (e.g., 100 MB). A stuck connector can cause Postgres to retain WAL indefinitely, filling your disk.

Common pitfalls#

WAL disk bloat — a paused or failed connector holds a replication slot open. Postgres retains WAL segments until the slot catches up. Monitor pg_replication_slots and drop orphaned slots.

Large transactions — a single transaction that updates 1 million rows produces 1 million events. This can spike memory in both the connector and consumers. Batch large migrations separately from CDC.

Snapshot on restart — Debezium snapshots the entire table on first start. For large tables, this can take hours and produce massive initial load. Use snapshot.mode=schema_only if you only need changes going forward.

Schema drift — if the source schema changes without the registry being updated, consumers may fail to deserialize events.

Alternatives to Debezium#

AWS DMS — managed CDC for AWS databases, simpler but less flexible
Google Datastream — managed CDC to BigQuery and Cloud Storage
Striim — commercial CDC platform with a visual pipeline builder
Materialize — streams CDC directly into incrementally maintained SQL views
Estuary Flow — real-time CDC with a declarative pipeline model

Visualize your CDC architecture#

Map your CDC pipeline end-to-end — from source database through Kafka to consumers — with Codelit. Generate an interactive architecture diagram in seconds.

Key takeaways#

Log-based CDC reads the WAL/binlog for zero-impact, complete change capture
Debezium + Kafka Connect is the open-source standard for CDC pipelines
One topic per table, keyed by primary key, preserves ordering
Materialized views in Elasticsearch, Redis, or warehouses replace fragile dual writes
Monitor replication slot lag — a stuck slot can fill your Postgres disk
The outbox pattern ensures reliable event publishing within transactions
This is article #381 of our ongoing system design series

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

distributed systems

CAP Theorem Explained — A Practical Guide to Distributed System Trade-offs

7 min read

change data capture

Change Data Capture (CDC): Stream Database Changes in Real Time

6 min read

security

Data Anonymization Techniques — Protecting Privacy Without Losing Utility

7 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

Real-Time Collaborative Editor

Notion-like document editor with real-time collaboration, conflict resolution, and rich media.

9 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Build this architecture

Generate an interactive architecture for CDC Streaming in seconds.

Try it in Codelit →

databasesstreamingkafkadata-engineering

CDC Streaming — Real-Time Data Pipelines with Debezium & Kafka

March 29, 2026 7 min readBy Codelit Team Discussion

What is Change Data Capture?#

Why it matters: batch ETL jobs introduce hours of latency. CDC pipelines deliver changes in seconds.

Log-based CDC vs. other approaches#

There are three common CDC strategies:

Trigger-based CDC — database triggers write changes to a shadow table. Simple but adds write overhead to every transaction and couples your schema to the capture logic.

Query-based CDC — a poller runs SELECT * WHERE updated_at > ? on an interval. Misses deletes, can miss rapid updates between polls, and hammers the database with queries.

Log-based CDC is the gold standard for production pipelines.

Debezium: the open-source CDC engine#

Debezium is a distributed platform built on top of Kafka Connect that turns database logs into event streams. It supports:

PostgreSQL — logical replication slots
MySQL / MariaDB — binlog reading
MongoDB — oplog tailing
SQL Server — CDC tables
Oracle — LogMiner or XStream
Cassandra — commit log

Each connector runs as a Kafka Connect source connector, producing one Kafka topic per table.

Architecture of a CDC pipeline#

A typical CDC pipeline has four layers:

Source database — the operational database where your application writes data. You configure a replication slot (Postgres) or enable binlog (MySQL).

Debezium connector — runs inside Kafka Connect. Reads the transaction log, converts each change into a structured event envelope, and publishes to Kafka.

Kafka topics — one topic per table (e.g., dbserver1.public.orders). Events are keyed by the row's primary key, so updates to the same row land in the same partition.

Consumers — downstream services, materialized views, search indexes, data warehouses, or analytics engines that subscribe to topics and process change events.

The Debezium event envelope#

Every change event contains:

{
  "before": { "id": 42, "status": "pending" },
  "after":  { "id": 42, "status": "shipped" },
  "source": {
    "connector": "postgresql",
    "db": "orders_db",
    "table": "orders",
    "lsn": 234881104,
    "txId": 98712
  },
  "op": "u",
  "ts_ms": 1711670400000
}

The op field indicates the operation: c (create), u (update), d (delete), or r (snapshot read). The before and after fields give you the full row state.

Setting up Debezium with Kafka Connect#

Step 1 — enable logical replication in Postgres:

ALTER SYSTEM SET wal_level = 'logical';
ALTER SYSTEM SET max_replication_slots = 4;

Step 2 — create a connector configuration:

{
  "name": "orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.internal",
    "database.port": "5432",
    "database.user": "cdc_user",
    "database.dbname": "orders_db",
    "topic.prefix": "dbserver1",
    "table.include.list": "public.orders,public.customers",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot"
  }
}

Step 3 — deploy via the Kafka Connect REST API:

curl -X POST http://connect:8083/connectors \
  -H "Content-Type: application/json" \
  -d @orders-connector.json

Real-time materialized views#

A Kafka Streams or Flink job consumes the CDC topics, joins orders with customers, and writes the denormalized document to Elasticsearch. The view updates within seconds of a change in Postgres.

This replaces fragile dual-write patterns where the application writes to both Postgres and Elasticsearch.

Event-driven sync between databases#

CDC enables event-driven data sync without application-level coordination:

Postgres to data warehouse — stream changes to Snowflake, BigQuery, or Redshift via a Kafka sink connector
Postgres to cache — invalidate or update Redis keys when rows change
Cross-service sync — the Orders service writes to its database; the Shipping service reacts to order events without direct API coupling
Audit logging — every change event is an immutable audit record

Handling schema evolution#

Critical rule: never rename or drop a column without coordinating with downstream consumers. Use a two-phase migration — add the new column, migrate consumers, then drop the old column.

Exactly-once semantics#

CDC pipelines can achieve exactly-once delivery with:

Kafka transactions — Debezium 2.x supports exactly-once source connectors with Kafka transactions
Idempotent consumers — consumers use the event's LSN or transaction ID to deduplicate
Outbox pattern — write events to an outbox table inside the same transaction as your data change, then CDC picks up the outbox table

The outbox pattern is especially useful for microservices that need to publish domain events reliably.

Monitoring and operations#

Key metrics to monitor:

Replication slot lag — if the slot falls behind, you risk WAL bloat in Postgres
Consumer lag — Kafka consumer group lag per partition
Connector status — Kafka Connect exposes connector health via REST API
Throughput — events per second through each connector

Set alerts on replication slot lag exceeding a threshold (e.g., 100 MB). A stuck connector can cause Postgres to retain WAL indefinitely, filling your disk.

Common pitfalls#

WAL disk bloat — a paused or failed connector holds a replication slot open. Postgres retains WAL segments until the slot catches up. Monitor pg_replication_slots and drop orphaned slots.

Schema drift — if the source schema changes without the registry being updated, consumers may fail to deserialize events.

Alternatives to Debezium#

AWS DMS — managed CDC for AWS databases, simpler but less flexible
Google Datastream — managed CDC to BigQuery and Cloud Storage
Striim — commercial CDC platform with a visual pipeline builder
Materialize — streams CDC directly into incrementally maintained SQL views
Estuary Flow — real-time CDC with a declarative pipeline model

Visualize your CDC architecture#

Map your CDC pipeline end-to-end — from source database through Kafka to consumers — with Codelit. Generate an interactive architecture diagram in seconds.

Key takeaways#

Log-based CDC reads the WAL/binlog for zero-impact, complete change capture
Debezium + Kafka Connect is the open-source standard for CDC pipelines
One topic per table, keyed by primary key, preserves ordering
Materialized views in Elasticsearch, Redis, or warehouses replace fragile dual writes
Monitor replication slot lag — a stuck slot can fill your Postgres disk
The outbox pattern ensures reliable event publishing within transactions
This is article #381 of our ongoing system design series

{ }

Explore the Discord architecture interactively

Try it →

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

distributed systems

Build this architecture

Generate an interactive architecture for CDC Streaming in seconds.

Try it in Codelit →

CDC Streaming — Real-Time Data Pipelines with Debezium & Kafka

What is Change Data Capture?#

Log-based CDC vs. other approaches#

Debezium: the open-source CDC engine#

Architecture of a CDC pipeline#

The Debezium event envelope#

Setting up Debezium with Kafka Connect#

Real-time materialized views#

Event-driven sync between databases#

Handling schema evolution#

Exactly-once semantics#

Monitoring and operations#

Common pitfalls#

Alternatives to Debezium#

Visualize your CDC architecture#

Key takeaways#

Comments

Related articles

CAP Theorem Explained — A Practical Guide to Distributed System Trade-offs

Change Data Capture (CDC): Stream Database Changes in Real Time

Data Anonymization Techniques — Protecting Privacy Without Losing Utility

Try these templates

Uber Real-Time Location System

Real-Time Collaborative Editor

Netflix Video Streaming Architecture

Build this architecture

CDC Streaming — Real-Time Data Pipelines with Debezium & Kafka

What is Change Data Capture?#

Log-based CDC vs. other approaches#

Debezium: the open-source CDC engine#

Architecture of a CDC pipeline#

The Debezium event envelope#

Setting up Debezium with Kafka Connect#

Real-time materialized views#

Event-driven sync between databases#

Handling schema evolution#

Exactly-once semantics#

Monitoring and operations#

Common pitfalls#

Alternatives to Debezium#

Visualize your CDC architecture#

Key takeaways#

Comments

Related articles

CAP Theorem Explained — A Practical Guide to Distributed System Trade-offs

Change Data Capture (CDC): Stream Database Changes in Real Time

Data Anonymization Techniques — Protecting Privacy Without Losing Utility

Try these templates

Uber Real-Time Location System

Real-Time Collaborative Editor

Netflix Video Streaming Architecture

Build this architecture