Data Pipeline Architecture — Batch, Streaming, and the Lambda Pattern
Data is useless in the wrong place#
Your production database has the data. Your analytics team needs it in a warehouse. Your ML models need it as features. Your dashboards need it in real-time.
A data pipeline moves data from source to destination, transforming it along the way.
The three processing models#
Batch processing#
Process large volumes of data on a schedule. Run every hour, day, or week.
Example: Every night at 2am, extract all new orders from PostgreSQL, transform them (calculate revenue, join with customer data), and load into the data warehouse.
Tools: Apache Spark, dbt, Airflow, AWS Glue.
When to use: Analytics, reporting, ML training data. Anywhere a few hours of delay is acceptable.
Stream processing#
Process data as it arrives, in real-time or near-real-time.
Example: Every time a user clicks "buy," immediately update the recommendation model, fraud score, and inventory count.
Tools: Apache Kafka + Flink, Spark Streaming, AWS Kinesis.
When to use: Real-time dashboards, fraud detection, live notifications. Anywhere freshness matters more than throughput.
Lambda architecture#
Run both batch and stream processing in parallel. The batch layer provides complete, accurate data. The stream layer provides fast, approximate data. Query results merge both.
When to use: When you need both real-time speed and batch accuracy. Common in analytics platforms.
The downside: Maintaining two codebases (batch + stream) that do similar things. The Kappa architecture simplifies this by using streaming for everything.
The ELT pattern#
Modern data pipelines have shifted from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform):
ETL (old): Transform data before loading. Requires a dedicated ETL server.
ELT (modern): Load raw data first, transform inside the warehouse. The warehouse (Snowflake, BigQuery) has cheap compute for transformations.
Why ELT won: Warehouses got powerful enough to handle transformations. Loading raw data means you can always re-transform without re-extracting.
Pipeline components#
Sources#
Where data comes from:
- Production databases (PostgreSQL, MySQL)
- SaaS APIs (Stripe, Salesforce, HubSpot)
- Event streams (Kafka, webhooks)
- Files (CSV, Parquet, JSON uploads)
Ingestion#
Move data from source to destination:
- Fivetran/Airbyte: Managed connectors for 300+ sources
- Kafka Connect: Stream-based connectors
- Custom scripts: When nothing else fits
Transformation#
Clean, model, and enrich the data:
- dbt: SQL-based transformations with testing and documentation
- Spark: For complex transformations on massive datasets
- Python/pandas: For data science workflows
Orchestration#
Schedule and monitor the pipeline:
- Airflow: The standard for complex DAG scheduling
- Dagster: Modern alternative with better dev experience
- Prefect: Cloud-native orchestration
Common failure modes#
Schema changes. The source table adds a column, your pipeline breaks. Fix: schema evolution handling and alerts on schema drift.
Late-arriving data. Events arrive out of order or delayed. Fix: watermarks and grace periods in stream processing.
Duplicate data. Network retries cause the same event to be processed twice. Fix: idempotent processing with deduplication keys.
Pipeline monitoring. Data quality issues silently corrupt your warehouse. Fix: dbt tests, Great Expectations, or Monte Carlo for data observability.
See data pipelines in your architecture#
On Codelit, search for "data warehouse" or "ML pipeline" in ⌘K to see complete data architectures — from source ingestion through transformation to serving layer.
Build your data pipeline: describe your data system on Codelit.io and see how data flows from sources to insights.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
OpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsGitHub-like CI/CD Pipeline
Continuous integration and deployment system with parallel jobs, artifact caching, and environment management.
9 componentsBuild this architecture
Generate an interactive Data Pipeline Architecture in seconds.
Try it in Codelit →
Comments