data pipeline architectureETL pipelinedata engineeringstream processingbatch processingApache Kafka pipelinesystem design

Data Pipeline Architecture: A Complete Guide to Building Scalable ETL and Streaming Systems

March 28, 2026 6 min readBy Codelit Team Discussion

Every modern company runs on data. Whether you are powering a recommendation engine, generating compliance reports, or feeding a real-time dashboard, you need a data pipeline architecture that moves data reliably from source to destination while transforming it along the way.

This guide covers the core patterns, trade-offs, and tools you need to design production-grade pipelines.

What Is a Data Pipeline?#

A data pipeline is an automated workflow that ingests raw data from one or more sources, transforms it into a useful shape, loads it into a target store, and serves it to downstream consumers.

┌──────────┐    ┌─────────────┐    ┌──────────┐    ┌──────────┐
│  Ingest   │───▶│  Transform  │───▶│   Load   │───▶│  Serve   │
│ (sources) │    │ (clean/join)│    │ (storage)│    │ (query)  │
└──────────┘    └─────────────┘    └──────────┘    └──────────┘

Each stage can fail independently, so a well-designed pipeline treats observability, idempotency, and schema management as first-class concerns.

Batch Processing vs Stream Processing#

Batch Processing#

Batch pipelines process data in discrete chunks — hourly, daily, or on-demand. They are simpler to reason about and easier to retry.

Trait	Batch
Latency	Minutes to hours
Complexity	Lower
Tools	Spark, Airflow, dbt
Use case	Reports, ML training, backfills

Stream Processing#

Stream pipelines process events continuously as they arrive. They power real-time analytics, fraud detection, and live dashboards.

Trait	Stream
Latency	Milliseconds to seconds
Complexity	Higher
Tools	Kafka, Flink, Spark Streaming
Use case	Alerting, CDC, real-time features

Hybrid (Lambda & Kappa)#

Many production systems combine both. The Lambda architecture runs a batch layer for correctness alongside a speed layer for low latency. The Kappa architecture simplifies this by treating everything as a stream and replaying the log when corrections are needed.

Lambda Architecture

         ┌─────── Batch Layer (Spark/Airflow) ──────┐
Source ──▶│                                          │──▶ Serving Layer
         └─────── Speed Layer (Kafka/Flink)  ────────┘

Kappa Architecture

Source ──▶ Stream Layer (Kafka + Flink) ──▶ Serving Layer
              ▲                │
              └── replay ──────┘

ETL vs ELT#

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was the standard when storage was expensive and compute happened on-prem.

ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using SQL. Cloud warehouses like Snowflake and BigQuery make this cost-effective because compute scales independently from storage.

ETL:  Source ──▶ Transform (Spark) ──▶ Load (Warehouse)
ELT:  Source ──▶ Load (Warehouse)  ──▶ Transform (dbt/SQL)

Modern stacks overwhelmingly favor ELT because it preserves raw data, simplifies ingestion, and lets analysts iterate on transformations without re-running the entire pipeline.

Pipeline Stages in Detail#

1. Ingest#

Pull data from APIs, databases (CDC), event streams, or file drops. Tools: Kafka, Debezium, Fivetran, Airbyte.

Key concern: exactly-once processing. Kafka achieves this through idempotent producers and transactional consumers. Without it, you get duplicates or data loss.

2. Transform#

Clean, deduplicate, join, and enrich. This is where business logic lives.

Batch transforms: dbt models, Spark jobs, SQL scripts in Airflow
Stream transforms: Flink SQL, Kafka Streams, Spark Structured Streaming

3. Load#

Write the transformed data to a target: a data warehouse (Snowflake, BigQuery, Redshift), a data lake (S3 + Iceberg/Delta), or an operational database.

4. Serve#

Expose data through dashboards (Looker, Metabase), APIs, reverse ETL back into SaaS tools, or ML feature stores.

Data Lake vs Data Warehouse#

	Data Lake	Data Warehouse
Storage	Object store (S3, GCS)	Managed columnar store
Schema	Schema-on-read	Schema-on-write
Cost	Low (storage)	Higher (compute + storage)
Users	Data engineers, ML engineers	Analysts, BI
Formats	Parquet, Avro, ORC	Proprietary

The data lakehouse pattern (Delta Lake, Apache Iceberg) merges both: cheap object storage with ACID transactions and schema enforcement.

Schema Evolution#

Schemas change. Columns get added, types get widened, fields get deprecated. A production pipeline must handle this gracefully.

Strategies:

Schema registry (Confluent Schema Registry) — validates schema compatibility before publishing
Backward-compatible changes only — add columns with defaults, never remove or rename
Versioned tables — events_v1, events_v2 with migration views
Format support — Avro and Protobuf handle schema evolution natively; JSON does not

Exactly-Once Processing#

The hardest guarantee in distributed systems. Three levels:

At-most-once — fire and forget. Fast but lossy.
At-least-once — retry on failure. Risk of duplicates.
Exactly-once — each record processed once. Requires coordination.

Apache Kafka supports exactly-once semantics (EOS) via idempotent producers + transactional consumers. Flink achieves it through distributed snapshots (Chandy-Lamport algorithm). In practice, many teams settle for at-least-once delivery with idempotent writes (upserts keyed on a unique ID).

Orchestration#

Orchestrators schedule, retry, and monitor pipeline tasks.

Airflow DAG Example

ingest_api ──▶ stage_raw ──▶ transform_dbt ──▶ test_quality ──▶ publish
     │                              │
     └── on_failure: alert ─────────┘

Apache Airflow — the most widely adopted DAG-based orchestrator
Dagster — asset-centric, strong typing, built-in lineage
Prefect — Python-native, dynamic DAGs
dbt Cloud — orchestrates SQL transformations specifically

Monitoring Data Quality#

Bad data is worse than no data. Quality checks should run at every stage.

Check	Description	Tool
Freshness	Data arrived on time	dbt source freshness, Monte Carlo
Volume	Row counts within expected range	Great Expectations, Soda
Schema	No unexpected columns or type changes	Schema registry
Uniqueness	Primary keys are not duplicated	dbt tests
Distribution	Values within statistical bounds	Monte Carlo, Anomalo
Referential integrity	Foreign keys resolve	SQL assertions

Set up alerts on failures and circuit breakers that halt downstream jobs when upstream quality degrades.

Reference Architecture#

┌─────────────────────────────────────────────────────────────────┐
│                     Data Pipeline Architecture                   │
│                                                                  │
│  Sources          Ingestion        Processing        Serving     │
│  ───────          ─────────        ──────────        ───────     │
│  APIs      ──┐                                                   │
│  Databases ──┼──▶  Kafka  ──┬──▶  Flink (stream) ──▶ Dashboards │
│  Events    ──┤              │                                    │
│  Files     ──┘    Airbyte ──┼──▶  Spark (batch)  ──▶ ML Models  │
│                             │                                    │
│                             └──▶  dbt (SQL)      ──▶ APIs        │
│                                       │                          │
│                    ┌──────────────────┘                          │
│                    ▼                                             │
│              ┌───────────┐    ┌───────────┐                     │
│              │ Data Lake  │───▶│ Warehouse │                     │
│              │ (Iceberg)  │    │(Snowflake)│                     │
│              └───────────┘    └───────────┘                     │
│                                                                  │
│  Orchestration: Airflow          Monitoring: Monte Carlo / dbt  │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways#

Start with ELT unless you have a strong reason for pre-load transformation.
Use Kafka as the central nervous system for event-driven architectures.
Design for idempotency — exactly-once is expensive; idempotent at-least-once is practical.
Enforce schemas early with a registry to prevent garbage-in, garbage-out.
Monitor data quality as aggressively as you monitor application uptime.
Pick the right processing model — batch for throughput, stream for latency, hybrid for both.

Building data pipelines that scale? Codelit helps engineering teams design, review, and ship production-grade data infrastructure. Explore our system design guides and tools at codelit.io.

137 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Agentic Data Pipeline Workflow

2 min read

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

Try these templates

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

data pipeline architectureETL pipelinedata engineeringstream processingbatch processingApache Kafka pipelinesystem design

Data Pipeline Architecture: A Complete Guide to Building Scalable ETL and Streaming Systems

March 28, 2026 6 min readBy Codelit Team Discussion

This guide covers the core patterns, trade-offs, and tools you need to design production-grade pipelines.

What Is a Data Pipeline?#

┌──────────┐    ┌─────────────┐    ┌──────────┐    ┌──────────┐
│  Ingest   │───▶│  Transform  │───▶│   Load   │───▶│  Serve   │
│ (sources) │    │ (clean/join)│    │ (storage)│    │ (query)  │
└──────────┘    └─────────────┘    └──────────┘    └──────────┘

Each stage can fail independently, so a well-designed pipeline treats observability, idempotency, and schema management as first-class concerns.

Batch Processing vs Stream Processing#

Batch Processing#

Batch pipelines process data in discrete chunks — hourly, daily, or on-demand. They are simpler to reason about and easier to retry.

Trait	Batch
Latency	Minutes to hours
Complexity	Lower
Tools	Spark, Airflow, dbt
Use case	Reports, ML training, backfills

Stream Processing#

Stream pipelines process events continuously as they arrive. They power real-time analytics, fraud detection, and live dashboards.

Trait	Stream
Latency	Milliseconds to seconds
Complexity	Higher
Tools	Kafka, Flink, Spark Streaming
Use case	Alerting, CDC, real-time features

Hybrid (Lambda & Kappa)#

Lambda Architecture

         ┌─────── Batch Layer (Spark/Airflow) ──────┐
Source ──▶│                                          │──▶ Serving Layer
         └─────── Speed Layer (Kafka/Flink)  ────────┘

Kappa Architecture

Source ──▶ Stream Layer (Kafka + Flink) ──▶ Serving Layer
              ▲                │
              └── replay ──────┘

ETL vs ELT#

ETL (Extract, Transform, Load) transforms data before loading it into the warehouse. This was the standard when storage was expensive and compute happened on-prem.

ETL:  Source ──▶ Transform (Spark) ──▶ Load (Warehouse)
ELT:  Source ──▶ Load (Warehouse)  ──▶ Transform (dbt/SQL)

Modern stacks overwhelmingly favor ELT because it preserves raw data, simplifies ingestion, and lets analysts iterate on transformations without re-running the entire pipeline.

Pipeline Stages in Detail#

1. Ingest#

Pull data from APIs, databases (CDC), event streams, or file drops. Tools: Kafka, Debezium, Fivetran, Airbyte.

Key concern: exactly-once processing. Kafka achieves this through idempotent producers and transactional consumers. Without it, you get duplicates or data loss.

2. Transform#

Clean, deduplicate, join, and enrich. This is where business logic lives.

Batch transforms: dbt models, Spark jobs, SQL scripts in Airflow
Stream transforms: Flink SQL, Kafka Streams, Spark Structured Streaming

3. Load#

Write the transformed data to a target: a data warehouse (Snowflake, BigQuery, Redshift), a data lake (S3 + Iceberg/Delta), or an operational database.

4. Serve#

Expose data through dashboards (Looker, Metabase), APIs, reverse ETL back into SaaS tools, or ML feature stores.

Data Lake vs Data Warehouse#

	Data Lake	Data Warehouse
Storage	Object store (S3, GCS)	Managed columnar store
Schema	Schema-on-read	Schema-on-write
Cost	Low (storage)	Higher (compute + storage)
Users	Data engineers, ML engineers	Analysts, BI
Formats	Parquet, Avro, ORC	Proprietary

The data lakehouse pattern (Delta Lake, Apache Iceberg) merges both: cheap object storage with ACID transactions and schema enforcement.

Schema Evolution#

Schemas change. Columns get added, types get widened, fields get deprecated. A production pipeline must handle this gracefully.

Strategies:

Schema registry (Confluent Schema Registry) — validates schema compatibility before publishing
Backward-compatible changes only — add columns with defaults, never remove or rename
Versioned tables — events_v1, events_v2 with migration views
Format support — Avro and Protobuf handle schema evolution natively; JSON does not

Exactly-Once Processing#

The hardest guarantee in distributed systems. Three levels:

At-most-once — fire and forget. Fast but lossy.
At-least-once — retry on failure. Risk of duplicates.
Exactly-once — each record processed once. Requires coordination.

Orchestration#

Orchestrators schedule, retry, and monitor pipeline tasks.

Airflow DAG Example

ingest_api ──▶ stage_raw ──▶ transform_dbt ──▶ test_quality ──▶ publish
     │                              │
     └── on_failure: alert ─────────┘

Apache Airflow — the most widely adopted DAG-based orchestrator
Dagster — asset-centric, strong typing, built-in lineage
Prefect — Python-native, dynamic DAGs
dbt Cloud — orchestrates SQL transformations specifically

Monitoring Data Quality#

Bad data is worse than no data. Quality checks should run at every stage.

Check	Description	Tool
Freshness	Data arrived on time	dbt source freshness, Monte Carlo
Volume	Row counts within expected range	Great Expectations, Soda
Schema	No unexpected columns or type changes	Schema registry
Uniqueness	Primary keys are not duplicated	dbt tests
Distribution	Values within statistical bounds	Monte Carlo, Anomalo
Referential integrity	Foreign keys resolve	SQL assertions

Set up alerts on failures and circuit breakers that halt downstream jobs when upstream quality degrades.

Reference Architecture#

┌─────────────────────────────────────────────────────────────────┐
│                     Data Pipeline Architecture                   │
│                                                                  │
│  Sources          Ingestion        Processing        Serving     │
│  ───────          ─────────        ──────────        ───────     │
│  APIs      ──┐                                                   │
│  Databases ──┼──▶  Kafka  ──┬──▶  Flink (stream) ──▶ Dashboards │
│  Events    ──┤              │                                    │
│  Files     ──┘    Airbyte ──┼──▶  Spark (batch)  ──▶ ML Models  │
│                             │                                    │
│                             └──▶  dbt (SQL)      ──▶ APIs        │
│                                       │                          │
│                    ┌──────────────────┘                          │
│                    ▼                                             │
│              ┌───────────┐    ┌───────────┐                     │
│              │ Data Lake  │───▶│ Warehouse │                     │
│              │ (Iceberg)  │    │(Snowflake)│                     │
│              └───────────┘    └───────────┘                     │
│                                                                  │
│  Orchestration: Airflow          Monitoring: Monte Carlo / dbt  │
└─────────────────────────────────────────────────────────────────┘

Key Takeaways#

Start with ELT unless you have a strong reason for pre-load transformation.
Use Kafka as the central nervous system for event-driven architectures.
Design for idempotency — exactly-once is expensive; idempotent at-least-once is practical.
Enforce schemas early with a registry to prevent garbage-in, garbage-out.
Monitor data quality as aggressively as you monitor application uptime.
Pick the right processing model — batch for throughput, stream for latency, hybrid for both.

Building data pipelines that scale? Codelit helps engineering teams design, review, and ship production-grade data infrastructure. Explore our system design guides and tools at codelit.io.

137 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

Data Pipeline Architecture: A Complete Guide to Building Scalable ETL and Streaming Systems

What Is a Data Pipeline?#

Batch Processing vs Stream Processing#

Batch Processing#

Stream Processing#

Hybrid (Lambda & Kappa)#

ETL vs ELT#

Pipeline Stages in Detail#

1. Ingest#

2. Transform#

3. Load#

4. Serve#

Data Lake vs Data Warehouse#

Schema Evolution#

Exactly-Once Processing#

Orchestration#

Monitoring Data Quality#

Reference Architecture#

Key Takeaways#

Comments

Related articles

Agentic Data Pipeline Workflow

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

OpenAI API Request Pipeline

Scalable SaaS Application

Netflix Video Streaming Architecture

Build this architecture

Data Pipeline Architecture: A Complete Guide to Building Scalable ETL and Streaming Systems

What Is a Data Pipeline?#

Batch Processing vs Stream Processing#

Batch Processing#

Stream Processing#

Hybrid (Lambda & Kappa)#

ETL vs ELT#

Pipeline Stages in Detail#

1. Ingest#

2. Transform#

3. Load#

4. Serve#

Data Lake vs Data Warehouse#

Schema Evolution#

Exactly-Once Processing#

Orchestration#

Monitoring Data Quality#

Reference Architecture#

Key Takeaways#

Comments

Related articles

Agentic Data Pipeline Workflow

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

Try these templates

OpenAI API Request Pipeline

Scalable SaaS Application

Netflix Video Streaming Architecture

Build this architecture