Data Pipeline Architecture: A Complete Guide to Building Scalable ETL and Streaming Systems
Every modern company runs on data. Whether you are powering a recommendation engine, generating compliance reports, or feeding a real-time dashboard, you need a data pipeline architecture that moves data reliably from source to destination while transforming it along the way.
This guide covers the core patterns, trade-offs, and tools you need to design production-grade pipelines.
What Is a Data Pipeline?#
A data pipeline is an automated workflow that ingests raw data from one or more sources, transforms it into a useful shape, loads it into a target store, and serves it to downstream consumers.
┌──────────┐ ┌─────────────┐ ┌──────────┐ ┌──────────┐
│ Ingest │───▶│ Transform │───▶│ Load │───▶│ Serve │
│ (sources) │ │ (clean/join)│ │ (storage)│ │ (query) │
└──────────┘ └─────────────┘ └──────────┘ └──────────┘
Each stage can fail independently, so a well-designed pipeline treats observability, idempotency, and schema management as first-class concerns.
Batch Processing vs Stream Processing#
Batch Processing#
Batch pipelines process data in discrete chunks — hourly, daily, or on-demand. They are simpler to reason about and easier to retry.
| Trait | Batch |
|---|---|
| Latency | Minutes to hours |
| Complexity | Lower |
| Tools | Spark, Airflow, dbt |
| Use case | Reports, ML training, backfills |
Stream Processing#
Stream pipelines process events continuously as they arrive. They power real-time analytics, fraud detection, and live dashboards.
| Trait | Stream |
|---|---|
| Latency | Milliseconds to seconds |
| Complexity | Higher |
| Tools | Kafka, Flink, Spark Streaming |
| Use case | Alerting, CDC, real-time features |
Hybrid (Lambda & Kappa)#
Many production systems combine both. The Lambda architecture runs a batch layer for correctness alongside a speed layer for low latency. The Kappa architecture simplifies this by treating everything as a stream and replaying the log when corrections are needed.
Lambda Architecture
┌─────── Batch Layer (Spark/Airflow) ──────┐
Source ──▶│ │──▶ Serving Layer
└─────── Speed Layer (Kafka/Flink) ────────┘
Kappa Architecture
Source ──▶ Stream Layer (Kafka + Flink) ──▶ Serving Layer
▲ │
└── replay ──────┘
ETL vs ELT#
ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was the standard when storage was expensive and compute happened on-prem.
ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using SQL. Cloud warehouses like Snowflake and BigQuery make this cost-effective because compute scales independently from storage.
ETL: Source ──▶ Transform (Spark) ──▶ Load (Warehouse)
ELT: Source ──▶ Load (Warehouse) ──▶ Transform (dbt/SQL)
Modern stacks overwhelmingly favor ELT because it preserves raw data, simplifies ingestion, and lets analysts iterate on transformations without re-running the entire pipeline.
Pipeline Stages in Detail#
1. Ingest#
Pull data from APIs, databases (CDC), event streams, or file drops. Tools: Kafka, Debezium, Fivetran, Airbyte.
Key concern: exactly-once processing. Kafka achieves this through idempotent producers and transactional consumers. Without it, you get duplicates or data loss.
2. Transform#
Clean, deduplicate, join, and enrich. This is where business logic lives.
- Batch transforms: dbt models, Spark jobs, SQL scripts in Airflow
- Stream transforms: Flink SQL, Kafka Streams, Spark Structured Streaming
3. Load#
Write the transformed data to a target: a data warehouse (Snowflake, BigQuery, Redshift), a data lake (S3 + Iceberg/Delta), or an operational database.
4. Serve#
Expose data through dashboards (Looker, Metabase), APIs, reverse ETL back into SaaS tools, or ML feature stores.
Data Lake vs Data Warehouse#
| Data Lake | Data Warehouse | |
|---|---|---|
| Storage | Object store (S3, GCS) | Managed columnar store |
| Schema | Schema-on-read | Schema-on-write |
| Cost | Low (storage) | Higher (compute + storage) |
| Users | Data engineers, ML engineers | Analysts, BI |
| Formats | Parquet, Avro, ORC | Proprietary |
The data lakehouse pattern (Delta Lake, Apache Iceberg) merges both: cheap object storage with ACID transactions and schema enforcement.
Schema Evolution#
Schemas change. Columns get added, types get widened, fields get deprecated. A production pipeline must handle this gracefully.
Strategies:
- Schema registry (Confluent Schema Registry) — validates schema compatibility before publishing
- Backward-compatible changes only — add columns with defaults, never remove or rename
- Versioned tables —
events_v1,events_v2with migration views - Format support — Avro and Protobuf handle schema evolution natively; JSON does not
Exactly-Once Processing#
The hardest guarantee in distributed systems. Three levels:
- At-most-once — fire and forget. Fast but lossy.
- At-least-once — retry on failure. Risk of duplicates.
- Exactly-once — each record processed once. Requires coordination.
Apache Kafka supports exactly-once semantics (EOS) via idempotent producers + transactional consumers. Flink achieves it through distributed snapshots (Chandy-Lamport algorithm). In practice, many teams settle for at-least-once delivery with idempotent writes (upserts keyed on a unique ID).
Orchestration#
Orchestrators schedule, retry, and monitor pipeline tasks.
Airflow DAG Example
ingest_api ──▶ stage_raw ──▶ transform_dbt ──▶ test_quality ──▶ publish
│ │
└── on_failure: alert ─────────┘
- Apache Airflow — the most widely adopted DAG-based orchestrator
- Dagster — asset-centric, strong typing, built-in lineage
- Prefect — Python-native, dynamic DAGs
- dbt Cloud — orchestrates SQL transformations specifically
Monitoring Data Quality#
Bad data is worse than no data. Quality checks should run at every stage.
| Check | Description | Tool |
|---|---|---|
| Freshness | Data arrived on time | dbt source freshness, Monte Carlo |
| Volume | Row counts within expected range | Great Expectations, Soda |
| Schema | No unexpected columns or type changes | Schema registry |
| Uniqueness | Primary keys are not duplicated | dbt tests |
| Distribution | Values within statistical bounds | Monte Carlo, Anomalo |
| Referential integrity | Foreign keys resolve | SQL assertions |
Set up alerts on failures and circuit breakers that halt downstream jobs when upstream quality degrades.
Reference Architecture#
┌─────────────────────────────────────────────────────────────────┐
│ Data Pipeline Architecture │
│ │
│ Sources Ingestion Processing Serving │
│ ─────── ───────── ────────── ─────── │
│ APIs ──┐ │
│ Databases ──┼──▶ Kafka ──┬──▶ Flink (stream) ──▶ Dashboards │
│ Events ──┤ │ │
│ Files ──┘ Airbyte ──┼──▶ Spark (batch) ──▶ ML Models │
│ │ │
│ └──▶ dbt (SQL) ──▶ APIs │
│ │ │
│ ┌──────────────────┘ │
│ ▼ │
│ ┌───────────┐ ┌───────────┐ │
│ │ Data Lake │───▶│ Warehouse │ │
│ │ (Iceberg) │ │(Snowflake)│ │
│ └───────────┘ └───────────┘ │
│ │
│ Orchestration: Airflow Monitoring: Monte Carlo / dbt │
└─────────────────────────────────────────────────────────────────┘
Key Takeaways#
- Start with ELT unless you have a strong reason for pre-load transformation.
- Use Kafka as the central nervous system for event-driven architectures.
- Design for idempotency — exactly-once is expensive; idempotent at-least-once is practical.
- Enforce schemas early with a registry to prevent garbage-in, garbage-out.
- Monitor data quality as aggressively as you monitor application uptime.
- Pick the right processing model — batch for throughput, stream for latency, hybrid for both.
Building data pipelines that scale? Codelit helps engineering teams design, review, and ship production-grade data infrastructure. Explore our system design guides and tools at codelit.io.
137 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsScalable SaaS Application
Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.
10 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsBuild this architecture
Generate an interactive Data Pipeline Architecture in seconds.
Try it in Codelit →
Comments