Data Pipeline Architecture: Batch, Streaming, and Modern ETL Patterns
Data Pipeline Architecture: Batch, Streaming, and Modern ETL#
Every company is a data company. The architecture that moves data from where it's created to where it's useful determines how fast you can make decisions, train models, and serve users.
This guide covers the patterns, tools, and trade-offs for building data pipelines at any scale.
What Is a Data Pipeline?#
Data Sources → Ingestion → Processing → Storage → Consumption
(databases, (collect) (transform) (warehouse) (dashboards,
APIs, logs, ML models,
events) reports)
Batch vs Streaming#
| Batch | Streaming | |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Processing | Process all data at once | Process each event as it arrives |
| Tools | Spark, Airflow, dbt | Kafka, Flink, Spark Streaming |
| Cost | Cheaper (process during off-peak) | More expensive (always running) |
| Complexity | Lower | Higher |
| Best for | Reports, ML training, nightly syncs | Alerts, real-time dashboards, fraud detection |
Most systems need both. Batch for historical analysis, streaming for real-time.
ETL vs ELT#
ETL (Extract, Transform, Load)#
Transform data BEFORE loading into the warehouse:
Sources → Extract → Transform (clean, join, aggregate) → Load → Warehouse
When: Data needs cleaning before it's useful, limited warehouse compute.
ELT (Extract, Load, Transform)#
Load raw data first, transform IN the warehouse:
Sources → Extract → Load (raw) → Warehouse → Transform (SQL/dbt)
When: Modern cloud warehouses (Snowflake, BigQuery) have cheap compute. Transform with SQL.
ELT is winning. Modern warehouses handle transformation better than external tools. dbt is the standard for SQL-based transforms.
Architecture Patterns#
1. Simple Batch Pipeline#
PostgreSQL → Airflow (daily schedule)
→ Extract (pg_dump or query)
→ Load to S3 (raw)
→ dbt transform (SQL)
→ BigQuery/Snowflake
→ Looker/Metabase dashboards
Tools: Airflow + dbt + cloud warehouse Latency: Daily or hourly Best for: Analytics, reporting, ML feature engineering
2. Real-Time Streaming#
App Events → Kafka → Stream Processor (Flink)
↓ ↓
Real-time DB Data Lake
(ClickHouse) (S3/Parquet)
↓ ↓
Live Dashboard Batch Analytics
Tools: Kafka + Flink + ClickHouse Latency: Sub-second Best for: Real-time dashboards, fraud detection, recommendations
3. Lambda Architecture (Batch + Streaming)#
Events → Kafka → Speed Layer (Flink, real-time) → Serving Layer
→ Batch Layer (Spark, hourly) → Serving Layer
↓
Merge results
↓
Application
Speed layer gives approximate real-time results. Batch layer provides accurate historical results. Serving layer merges both.
Downside: Two codepaths to maintain. Complexity is high.
4. Kappa Architecture (Streaming Only)#
Events → Kafka (immutable log) → Stream Processor → Serving Layer
↓
Application
Everything is a stream. Replay the Kafka log to reprocess historical data. No separate batch layer.
Simpler than Lambda but requires your streaming framework to handle both real-time and batch workloads.
5. Change Data Capture (CDC)#
Capture database changes as events:
PostgreSQL → Debezium (CDC) → Kafka → Stream Processor → Analytics DB
→ Search Index (Elasticsearch)
→ Cache (Redis)
No dual-write problem. Database is the source of truth. Downstream systems stay in sync via CDC events.
Tools: Debezium, AWS DMS, Airbyte, Fivetran
6. Modern Data Stack#
Sources (SaaS APIs, DBs, events)
→ Ingestion (Fivetran / Airbyte)
→ Raw Layer (Snowflake / BigQuery)
→ Transform (dbt)
→ Semantic Layer (dbt metrics / Cube)
→ BI (Looker / Metabase / Preset)
→ Reverse ETL (Census / Hightouch) → back to SaaS tools
The modern standard for analytics teams. SQL-centric, version-controlled (dbt), modular.
Tool Comparison#
Orchestration#
| Tool | Type | Best For |
|---|---|---|
| Airflow | DAG scheduler | Complex batch workflows |
| Dagster | Data-aware orchestrator | Modern alternative to Airflow |
| Prefect | Cloud-native orchestrator | Simpler than Airflow |
| dbt Cloud | SQL transform scheduler | dbt model runs |
Ingestion#
| Tool | Type | Best For |
|---|---|---|
| Fivetran | Managed EL | SaaS connectors (300+), zero code |
| Airbyte | Open-source EL | Self-hosted, custom connectors |
| Debezium | CDC | Database change capture to Kafka |
| Kafka Connect | Connector framework | Kafka ecosystem integration |
Processing#
| Tool | Type | Best For |
|---|---|---|
| dbt | SQL transforms | ELT in warehouse |
| Spark | Distributed compute | Large-scale batch processing |
| Flink | Stream processing | Real-time, exactly-once |
| Kafka Streams | Stream library | JVM apps, simple streaming |
Storage#
| Tool | Type | Best For |
|---|---|---|
| Snowflake | Cloud warehouse | Analytics, separation of compute/storage |
| BigQuery | Serverless warehouse | GCP native, serverless pricing |
| Databricks | Lakehouse | ML + analytics unified |
| ClickHouse | OLAP database | Real-time analytics, fast aggregations |
Data Quality#
Bad data is worse than no data. Build quality checks into the pipeline:
Source → Ingest → Quality Check → Transform → Quality Check → Serve
(schema validation) (row counts, nulls,
(freshness check) distribution checks)
Tools:
- Great Expectations — data quality assertions
- dbt tests — built-in schema and custom tests
- Monte Carlo — data observability platform
- Soda — data quality checks as code
Best Practices#
- Idempotent pipelines — rerunning produces the same result
- Schema evolution — handle new/removed/changed fields gracefully
- Backfill capability — ability to reprocess historical data
- Monitoring — track freshness, row counts, schema changes
- Incremental processing — don't reprocess everything every run
- Data contracts — producers and consumers agree on schema
- Lineage tracking — know where every column comes from
Scaling Decisions#
| Data Volume | Recommended |
|---|---|
| < 1GB/day | PostgreSQL + dbt + cron |
| 1-100GB/day | Cloud warehouse + Airflow + dbt |
| 100GB-10TB/day | Spark/Databricks + Kafka + cloud warehouse |
| > 10TB/day | Custom: Flink + Kafka + data lake (Iceberg) |
Summary#
- ELT over ETL — let the warehouse do the heavy lifting
- Batch for analytics, streaming for real-time — most need both
- CDC for database sync — no dual writes, Debezium + Kafka
- dbt for transforms — SQL, version-controlled, tested
- Data quality is non-negotiable — test at every stage
- Start simple — PostgreSQL + dbt before Kafka + Flink
This is the 100th blog post on Codelit.io! Design your data pipeline architecture at codelit.io — 100 product specs, 100 blog posts, 90+ templates, 29 export formats.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsSpotify Music Streaming Platform
Music streaming with personalized recommendations, offline sync, and social features.
9 componentsBuild this architecture
Generate an interactive Data Pipeline Architecture in seconds.
Try it in Codelit →
Comments