datasystem-designarchitecture

Data Pipeline Architecture — Batch, Streaming, and the Lambda Pattern

March 23, 2026 3 min readBy Mo Discussion

Data is useless in the wrong place#

Your production database has the data. Your analytics team needs it in a warehouse. Your ML models need it as features. Your dashboards need it in real-time.

A data pipeline moves data from source to destination, transforming it along the way.

The three processing models#

Batch processing#

Process large volumes of data on a schedule. Run every hour, day, or week.

Example: Every night at 2am, extract all new orders from PostgreSQL, transform them (calculate revenue, join with customer data), and load into the data warehouse.

Tools: Apache Spark, dbt, Airflow, AWS Glue.

When to use: Analytics, reporting, ML training data. Anywhere a few hours of delay is acceptable.

Stream processing#

Process data as it arrives, in real-time or near-real-time.

Example: Every time a user clicks "buy," immediately update the recommendation model, fraud score, and inventory count.

Tools: Apache Kafka + Flink, Spark Streaming, AWS Kinesis.

When to use: Real-time dashboards, fraud detection, live notifications. Anywhere freshness matters more than throughput.

Lambda architecture#

Run both batch and stream processing in parallel. The batch layer provides complete, accurate data. The stream layer provides fast, approximate data. Query results merge both.

When to use: When you need both real-time speed and batch accuracy. Common in analytics platforms.

The downside: Maintaining two codebases (batch + stream) that do similar things. The Kappa architecture simplifies this by using streaming for everything.

The ELT pattern#

Modern data pipelines have shifted from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform):

ETL (old): Transform data before loading. Requires a dedicated ETL server.

ELT (modern): Load raw data first, transform inside the warehouse. The warehouse (Snowflake, BigQuery) has cheap compute for transformations.

Why ELT won: Warehouses got powerful enough to handle transformations. Loading raw data means you can always re-transform without re-extracting.

Pipeline components#

Sources#

Where data comes from:

Production databases (PostgreSQL, MySQL)
SaaS APIs (Stripe, Salesforce, HubSpot)
Event streams (Kafka, webhooks)
Files (CSV, Parquet, JSON uploads)

Ingestion#

Move data from source to destination:

Fivetran/Airbyte: Managed connectors for 300+ sources
Kafka Connect: Stream-based connectors
Custom scripts: When nothing else fits

Transformation#

Clean, model, and enrich the data:

dbt: SQL-based transformations with testing and documentation
Spark: For complex transformations on massive datasets
Python/pandas: For data science workflows

Orchestration#

Schedule and monitor the pipeline:

Airflow: The standard for complex DAG scheduling
Dagster: Modern alternative with better dev experience
Prefect: Cloud-native orchestration

Common failure modes#

Schema changes. The source table adds a column, your pipeline breaks. Fix: schema evolution handling and alerts on schema drift.

Late-arriving data. Events arrive out of order or delayed. Fix: watermarks and grace periods in stream processing.

Duplicate data. Network retries cause the same event to be processed twice. Fix: idempotent processing with deduplication keys.

Pipeline monitoring. Data quality issues silently corrupt your warehouse. Fix: dbt tests, Great Expectations, or Monte Carlo for data observability.

See data pipelines in your architecture#

On Codelit, search for "data warehouse" or "ML pipeline" in ⌘K to see complete data architectures — from source ingestion through transformation to serving layer.

Build your data pipeline: describe your data system on Codelit.io and see how data flows from sources to insights.

{ }

Explore the Spotify architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

GitHub-like CI/CD Pipeline

Continuous integration and deployment system with parallel jobs, artifact caching, and environment management.

9 components

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

datasystem-designarchitecture

Data Pipeline Architecture — Batch, Streaming, and the Lambda Pattern

March 23, 2026 3 min readBy Mo Discussion

Data is useless in the wrong place#

Your production database has the data. Your analytics team needs it in a warehouse. Your ML models need it as features. Your dashboards need it in real-time.

A data pipeline moves data from source to destination, transforming it along the way.

The three processing models#

Batch processing#

Process large volumes of data on a schedule. Run every hour, day, or week.

Example: Every night at 2am, extract all new orders from PostgreSQL, transform them (calculate revenue, join with customer data), and load into the data warehouse.

Tools: Apache Spark, dbt, Airflow, AWS Glue.

When to use: Analytics, reporting, ML training data. Anywhere a few hours of delay is acceptable.

Stream processing#

Process data as it arrives, in real-time or near-real-time.

Example: Every time a user clicks "buy," immediately update the recommendation model, fraud score, and inventory count.

Tools: Apache Kafka + Flink, Spark Streaming, AWS Kinesis.

When to use: Real-time dashboards, fraud detection, live notifications. Anywhere freshness matters more than throughput.

Lambda architecture#

Run both batch and stream processing in parallel. The batch layer provides complete, accurate data. The stream layer provides fast, approximate data. Query results merge both.

When to use: When you need both real-time speed and batch accuracy. Common in analytics platforms.

The downside: Maintaining two codebases (batch + stream) that do similar things. The Kappa architecture simplifies this by using streaming for everything.

The ELT pattern#

Modern data pipelines have shifted from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform):

ETL (old): Transform data before loading. Requires a dedicated ETL server.

ELT (modern): Load raw data first, transform inside the warehouse. The warehouse (Snowflake, BigQuery) has cheap compute for transformations.

Why ELT won: Warehouses got powerful enough to handle transformations. Loading raw data means you can always re-transform without re-extracting.

Pipeline components#

Sources#

Where data comes from:

Production databases (PostgreSQL, MySQL)
SaaS APIs (Stripe, Salesforce, HubSpot)
Event streams (Kafka, webhooks)
Files (CSV, Parquet, JSON uploads)

Ingestion#

Move data from source to destination:

Fivetran/Airbyte: Managed connectors for 300+ sources
Kafka Connect: Stream-based connectors
Custom scripts: When nothing else fits

Transformation#

Clean, model, and enrich the data:

dbt: SQL-based transformations with testing and documentation
Spark: For complex transformations on massive datasets
Python/pandas: For data science workflows

Orchestration#

Schedule and monitor the pipeline:

Airflow: The standard for complex DAG scheduling
Dagster: Modern alternative with better dev experience
Prefect: Cloud-native orchestration

Common failure modes#

Schema changes. The source table adds a column, your pipeline breaks. Fix: schema evolution handling and alerts on schema drift.

Late-arriving data. Events arrive out of order or delayed. Fix: watermarks and grace periods in stream processing.

Duplicate data. Network retries cause the same event to be processed twice. Fix: idempotent processing with deduplication keys.

Pipeline monitoring. Data quality issues silently corrupt your warehouse. Fix: dbt tests, Great Expectations, or Monte Carlo for data observability.

See data pipelines in your architecture#

On Codelit, search for "data warehouse" or "ML pipeline" in ⌘K to see complete data architectures — from source ingestion through transformation to serving layer.

Build your data pipeline: describe your data system on Codelit.io and see how data flows from sources to insights.

{ }

Explore the Spotify architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Data Pipeline Architecture in seconds.

Try it in Codelit →

Data Pipeline Architecture — Batch, Streaming, and the Lambda Pattern

Data is useless in the wrong place#

The three processing models#

Batch processing#

Stream processing#

Lambda architecture#

The ELT pattern#

Pipeline components#

Sources#

Ingestion#

Transformation#

Orchestration#

Common failure modes#

See data pipelines in your architecture#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

GitHub-like CI/CD Pipeline

Build this architecture

Data Pipeline Architecture — Batch, Streaming, and the Lambda Pattern

Data is useless in the wrong place#

The three processing models#

Batch processing#

Stream processing#

Lambda architecture#

The ELT pattern#

Pipeline components#

Sources#

Ingestion#

Transformation#

Orchestration#

Common failure modes#

See data pipelines in your architecture#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

OpenAI API Request Pipeline

Netflix Video Streaming Architecture

GitHub-like CI/CD Pipeline

Build this architecture