data pipelineETLstreamingKafkasystem designdata engineering

Data Pipeline Architecture: Batch, Streaming, and Modern ETL Patterns

March 28, 2026 6 min readBy Codelit Team Discussion

Data Pipeline Architecture: Batch, Streaming, and Modern ETL#

Every company is a data company. The architecture that moves data from where it's created to where it's useful determines how fast you can make decisions, train models, and serve users.

This guide covers the patterns, tools, and trade-offs for building data pipelines at any scale.

What Is a Data Pipeline?#

Data Sources → Ingestion → Processing → Storage → Consumption
(databases,    (collect)   (transform)  (warehouse) (dashboards,
 APIs, logs,                                        ML models,
 events)                                            reports)

Batch vs Streaming#

	Batch	Streaming
Latency	Minutes to hours	Milliseconds to seconds
Processing	Process all data at once	Process each event as it arrives
Tools	Spark, Airflow, dbt	Kafka, Flink, Spark Streaming
Cost	Cheaper (process during off-peak)	More expensive (always running)
Complexity	Lower	Higher
Best for	Reports, ML training, nightly syncs	Alerts, real-time dashboards, fraud detection

Most systems need both. Batch for historical analysis, streaming for real-time.

ETL vs ELT#

ETL (Extract, Transform, Load)#

Transform data BEFORE loading into the warehouse:

Sources → Extract → Transform (clean, join, aggregate) → Load → Warehouse

When: Data needs cleaning before it's useful, limited warehouse compute.

ELT (Extract, Load, Transform)#

Load raw data first, transform IN the warehouse:

Sources → Extract → Load (raw) → Warehouse → Transform (SQL/dbt)

When: Modern cloud warehouses (Snowflake, BigQuery) have cheap compute. Transform with SQL.

ELT is winning. Modern warehouses handle transformation better than external tools. dbt is the standard for SQL-based transforms.

Architecture Patterns#

1. Simple Batch Pipeline#

PostgreSQL → Airflow (daily schedule)
                → Extract (pg_dump or query)
                → Load to S3 (raw)
                → dbt transform (SQL)
                → BigQuery/Snowflake
                → Looker/Metabase dashboards

Tools: Airflow + dbt + cloud warehouse Latency: Daily or hourly Best for: Analytics, reporting, ML feature engineering

2. Real-Time Streaming#

App Events → Kafka → Stream Processor (Flink)
                          ↓          ↓
                    Real-time DB   Data Lake
                    (ClickHouse)   (S3/Parquet)
                          ↓          ↓
                    Live Dashboard  Batch Analytics

Tools: Kafka + Flink + ClickHouse Latency: Sub-second Best for: Real-time dashboards, fraud detection, recommendations

3. Lambda Architecture (Batch + Streaming)#

Events → Kafka → Speed Layer (Flink, real-time) → Serving Layer
              → Batch Layer (Spark, hourly)     → Serving Layer
                                                      ↓
                                                  Merge results
                                                      ↓
                                                  Application

Speed layer gives approximate real-time results. Batch layer provides accurate historical results. Serving layer merges both.

Downside: Two codepaths to maintain. Complexity is high.

4. Kappa Architecture (Streaming Only)#

Events → Kafka (immutable log) → Stream Processor → Serving Layer
                                                          ↓
                                                     Application

Everything is a stream. Replay the Kafka log to reprocess historical data. No separate batch layer.

Simpler than Lambda but requires your streaming framework to handle both real-time and batch workloads.

5. Change Data Capture (CDC)#

Capture database changes as events:

PostgreSQL → Debezium (CDC) → Kafka → Stream Processor → Analytics DB
                                    → Search Index (Elasticsearch)
                                    → Cache (Redis)

No dual-write problem. Database is the source of truth. Downstream systems stay in sync via CDC events.

Tools: Debezium, AWS DMS, Airbyte, Fivetran

6. Modern Data Stack#

Sources (SaaS APIs, DBs, events)
    → Ingestion (Fivetran / Airbyte)
    → Raw Layer (Snowflake / BigQuery)
    → Transform (dbt)
    → Semantic Layer (dbt metrics / Cube)
    → BI (Looker / Metabase / Preset)
    → Reverse ETL (Census / Hightouch) → back to SaaS tools

The modern standard for analytics teams. SQL-centric, version-controlled (dbt), modular.

Tool Comparison#

Orchestration#

Tool	Type	Best For
Airflow	DAG scheduler	Complex batch workflows
Dagster	Data-aware orchestrator	Modern alternative to Airflow
Prefect	Cloud-native orchestrator	Simpler than Airflow
dbt Cloud	SQL transform scheduler	dbt model runs

Ingestion#

Tool	Type	Best For
Fivetran	Managed EL	SaaS connectors (300+), zero code
Airbyte	Open-source EL	Self-hosted, custom connectors
Debezium	CDC	Database change capture to Kafka
Kafka Connect	Connector framework	Kafka ecosystem integration

Processing#

Tool	Type	Best For
dbt	SQL transforms	ELT in warehouse
Spark	Distributed compute	Large-scale batch processing
Flink	Stream processing	Real-time, exactly-once
Kafka Streams	Stream library	JVM apps, simple streaming

Storage#

Tool	Type	Best For
Snowflake	Cloud warehouse	Analytics, separation of compute/storage
BigQuery	Serverless warehouse	GCP native, serverless pricing
Databricks	Lakehouse	ML + analytics unified
ClickHouse	OLAP database	Real-time analytics, fast aggregations

Data Quality#

Bad data is worse than no data. Build quality checks into the pipeline:

Source → Ingest → Quality Check → Transform → Quality Check → Serve
              (schema validation)         (row counts, nulls,
               (freshness check)          distribution checks)

Tools:

Great Expectations — data quality assertions
dbt tests — built-in schema and custom tests
Monte Carlo — data observability platform
Soda — data quality checks as code

Best Practices#

Idempotent pipelines — rerunning produces the same result
Schema evolution — handle new/removed/changed fields gracefully
Backfill capability — ability to reprocess historical data
Monitoring — track freshness, row counts, schema changes
Incremental processing — don't reprocess everything every run
Data contracts — producers and consumers agree on schema
Lineage tracking — know where every column comes from

Scaling Decisions#

Data Volume	Recommended
< 1GB/day	PostgreSQL + dbt + cron
1-100GB/day	Cloud warehouse + Airflow + dbt
100GB-10TB/day	Spark/Databricks + Kafka + cloud warehouse
> 10TB/day	Custom: Flink + Kafka + data lake (Iceberg)

Generate your data pipeline architecture →

Summary#

ELT over ETL — let the warehouse do the heavy lifting
Batch for analytics, streaming for real-time — most need both
CDC for database sync — no dual writes, Debezium + Kafka
dbt for transforms — SQL, version-controlled, tested
Data quality is non-negotiable — test at every stage
Start simple — PostgreSQL + dbt before Kafka + Flink

This is the 100th blog post on Codelit.io! Design your data pipeline architecture at codelit.io — 100 product specs, 100 blog posts, 90+ templates, 29 export formats.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Agentic Data Pipeline Workflow

2 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Spotify Music Streaming Platform

Music streaming with personalized recommendations, offline sync, and social features.

9 components

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

data pipelineETLstreamingKafkasystem designdata engineering

Data Pipeline Architecture: Batch, Streaming, and Modern ETL Patterns

March 28, 2026 6 min readBy Codelit Team Discussion

Data Pipeline Architecture: Batch, Streaming, and Modern ETL#

Every company is a data company. The architecture that moves data from where it's created to where it's useful determines how fast you can make decisions, train models, and serve users.

This guide covers the patterns, tools, and trade-offs for building data pipelines at any scale.

What Is a Data Pipeline?#

Data Sources → Ingestion → Processing → Storage → Consumption
(databases,    (collect)   (transform)  (warehouse) (dashboards,
 APIs, logs,                                        ML models,
 events)                                            reports)

Batch vs Streaming#

	Batch	Streaming
Latency	Minutes to hours	Milliseconds to seconds
Processing	Process all data at once	Process each event as it arrives
Tools	Spark, Airflow, dbt	Kafka, Flink, Spark Streaming
Cost	Cheaper (process during off-peak)	More expensive (always running)
Complexity	Lower	Higher
Best for	Reports, ML training, nightly syncs	Alerts, real-time dashboards, fraud detection

Most systems need both. Batch for historical analysis, streaming for real-time.

ETL vs ELT#

ETL (Extract, Transform, Load)#

Transform data BEFORE loading into the warehouse:

Sources → Extract → Transform (clean, join, aggregate) → Load → Warehouse

When: Data needs cleaning before it's useful, limited warehouse compute.

ELT (Extract, Load, Transform)#

Load raw data first, transform IN the warehouse:

Sources → Extract → Load (raw) → Warehouse → Transform (SQL/dbt)

When: Modern cloud warehouses (Snowflake, BigQuery) have cheap compute. Transform with SQL.

ELT is winning. Modern warehouses handle transformation better than external tools. dbt is the standard for SQL-based transforms.

Architecture Patterns#

1. Simple Batch Pipeline#

PostgreSQL → Airflow (daily schedule)
                → Extract (pg_dump or query)
                → Load to S3 (raw)
                → dbt transform (SQL)
                → BigQuery/Snowflake
                → Looker/Metabase dashboards

Tools: Airflow + dbt + cloud warehouse Latency: Daily or hourly Best for: Analytics, reporting, ML feature engineering

2. Real-Time Streaming#

App Events → Kafka → Stream Processor (Flink)
                          ↓          ↓
                    Real-time DB   Data Lake
                    (ClickHouse)   (S3/Parquet)
                          ↓          ↓
                    Live Dashboard  Batch Analytics

Tools: Kafka + Flink + ClickHouse Latency: Sub-second Best for: Real-time dashboards, fraud detection, recommendations

3. Lambda Architecture (Batch + Streaming)#

Events → Kafka → Speed Layer (Flink, real-time) → Serving Layer
              → Batch Layer (Spark, hourly)     → Serving Layer
                                                      ↓
                                                  Merge results
                                                      ↓
                                                  Application

Speed layer gives approximate real-time results. Batch layer provides accurate historical results. Serving layer merges both.

Downside: Two codepaths to maintain. Complexity is high.

4. Kappa Architecture (Streaming Only)#

Events → Kafka (immutable log) → Stream Processor → Serving Layer
                                                          ↓
                                                     Application

Everything is a stream. Replay the Kafka log to reprocess historical data. No separate batch layer.

Simpler than Lambda but requires your streaming framework to handle both real-time and batch workloads.

5. Change Data Capture (CDC)#

Capture database changes as events:

PostgreSQL → Debezium (CDC) → Kafka → Stream Processor → Analytics DB
                                    → Search Index (Elasticsearch)
                                    → Cache (Redis)

No dual-write problem. Database is the source of truth. Downstream systems stay in sync via CDC events.

Tools: Debezium, AWS DMS, Airbyte, Fivetran

6. Modern Data Stack#

Sources (SaaS APIs, DBs, events)
    → Ingestion (Fivetran / Airbyte)
    → Raw Layer (Snowflake / BigQuery)
    → Transform (dbt)
    → Semantic Layer (dbt metrics / Cube)
    → BI (Looker / Metabase / Preset)
    → Reverse ETL (Census / Hightouch) → back to SaaS tools

The modern standard for analytics teams. SQL-centric, version-controlled (dbt), modular.

Tool Comparison#

Orchestration#

Tool	Type	Best For
Airflow	DAG scheduler	Complex batch workflows
Dagster	Data-aware orchestrator	Modern alternative to Airflow
Prefect	Cloud-native orchestrator	Simpler than Airflow
dbt Cloud	SQL transform scheduler	dbt model runs

Ingestion#

Tool	Type	Best For
Fivetran	Managed EL	SaaS connectors (300+), zero code
Airbyte	Open-source EL	Self-hosted, custom connectors
Debezium	CDC	Database change capture to Kafka
Kafka Connect	Connector framework	Kafka ecosystem integration

Processing#

Tool	Type	Best For
dbt	SQL transforms	ELT in warehouse
Spark	Distributed compute	Large-scale batch processing
Flink	Stream processing	Real-time, exactly-once
Kafka Streams	Stream library	JVM apps, simple streaming

Storage#

Tool	Type	Best For
Snowflake	Cloud warehouse	Analytics, separation of compute/storage
BigQuery	Serverless warehouse	GCP native, serverless pricing
Databricks	Lakehouse	ML + analytics unified
ClickHouse	OLAP database	Real-time analytics, fast aggregations

Data Quality#

Bad data is worse than no data. Build quality checks into the pipeline:

Source → Ingest → Quality Check → Transform → Quality Check → Serve
              (schema validation)         (row counts, nulls,
               (freshness check)          distribution checks)

Tools:

Great Expectations — data quality assertions
dbt tests — built-in schema and custom tests
Monte Carlo — data observability platform
Soda — data quality checks as code

Best Practices#

Idempotent pipelines — rerunning produces the same result
Schema evolution — handle new/removed/changed fields gracefully
Backfill capability — ability to reprocess historical data
Monitoring — track freshness, row counts, schema changes
Incremental processing — don't reprocess everything every run
Data contracts — producers and consumers agree on schema
Lineage tracking — know where every column comes from

Scaling Decisions#

Data Volume	Recommended
< 1GB/day	PostgreSQL + dbt + cron
1-100GB/day	Cloud warehouse + Airflow + dbt
100GB-10TB/day	Spark/Databricks + Kafka + cloud warehouse
> 10TB/day	Custom: Flink + Kafka + data lake (Iceberg)

Generate your data pipeline architecture →

Summary#

ELT over ETL — let the warehouse do the heavy lifting
Batch for analytics, streaming for real-time — most need both
CDC for database sync — no dual writes, Debezium + Kafka
dbt for transforms — SQL, version-controlled, tested
Data quality is non-negotiable — test at every stage
Start simple — PostgreSQL + dbt before Kafka + Flink

This is the 100th blog post on Codelit.io! Design your data pipeline architecture at codelit.io — 100 product specs, 100 blog posts, 90+ templates, 29 export formats.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

Data Pipeline Architecture: Batch, Streaming, and Modern ETL Patterns

Data Pipeline Architecture: Batch, Streaming, and Modern ETL#

What Is a Data Pipeline?#

Batch vs Streaming#

ETL vs ELT#

ETL (Extract, Transform, Load)#

ELT (Extract, Load, Transform)#

Architecture Patterns#

1. Simple Batch Pipeline#

2. Real-Time Streaming#

3. Lambda Architecture (Batch + Streaming)#

4. Kappa Architecture (Streaming Only)#

5. Change Data Capture (CDC)#

6. Modern Data Stack#

Tool Comparison#

Orchestration#

Ingestion#

Processing#

Storage#

Data Quality#

Best Practices#

Scaling Decisions#

Summary#

Comments

Related articles

Agentic Data Pipeline Workflow

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

Spotify Music Streaming Platform

Build this architecture

Data Pipeline Architecture: Batch, Streaming, and Modern ETL Patterns

Data Pipeline Architecture: Batch, Streaming, and Modern ETL#

What Is a Data Pipeline?#

Batch vs Streaming#

ETL vs ELT#

ETL (Extract, Transform, Load)#

ELT (Extract, Load, Transform)#

Architecture Patterns#

1. Simple Batch Pipeline#

2. Real-Time Streaming#

3. Lambda Architecture (Batch + Streaming)#

4. Kappa Architecture (Streaming Only)#

5. Change Data Capture (CDC)#

6. Modern Data Stack#

Tool Comparison#

Orchestration#

Ingestion#

Processing#

Storage#

Data Quality#

Best Practices#

Scaling Decisions#

Summary#

Comments

Related articles

Agentic Data Pipeline Workflow

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

Spotify Music Streaming Platform

Build this architecture