analytics pipeline architecturedata pipelineevent collectionKafkaKinesisSparkFlinkClickHouseBigQueryreal-time analyticssystem design

Analytics Pipeline Architecture: From Event Collection to Real-Time Dashboards

March 28, 2026 7 min readBy Codelit Team Discussion

Every modern product needs analytics. Click-streams, purchase events, API latency metrics — the volume grows daily, and the questions stakeholders ask grow more complex. An analytics pipeline architecture turns that firehose of raw events into queryable, trustworthy data that powers dashboards, ML models, and business decisions.

High-Level Anatomy#

An analytics pipeline moves data through four stages:

Producers ──▶ Ingestion ──▶ Processing ──▶ Storage / Serving
   (apps,       (Kafka,       (Spark,        (ClickHouse,
    SDKs,       Kinesis)       Flink)         BigQuery,
    APIs)                                     dashboards)

Each stage can be scaled, replaced, or extended independently. That modularity is the defining strength of a well-designed pipeline.

Stage 1 — Event Collection#

Events originate from web and mobile clients, backend services, and infrastructure. Key design decisions include:

Schema definition — Use a schema registry (Confluent Schema Registry, Protobuf definitions) so every producer and consumer agrees on event shape.
Client-side SDKs — Buffer events locally and flush in batches to reduce network overhead. Segment, Rudderstack, and Snowplow all follow this pattern.
Server-side enrichment — Attach server-known context (user plan, geo-IP, feature flags) before the event enters the pipeline.
Identifier stitching — Map anonymous IDs to authenticated user IDs as early as possible.

A typical event envelope looks like this:

{
  "event": "page_viewed",
  "timestamp": "2026-03-28T14:22:01Z",
  "user_id": "u_8a3f",
  "anonymous_id": "anon_c91b",
  "properties": {
    "page": "/pricing",
    "referrer": "https://google.com",
    "experiment_variant": "B"
  },
  "context": {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ..."
  }
}

Stage 2 — Ingestion#

Ingestion decouples producers from consumers. The two dominant choices are Apache Kafka and Amazon Kinesis.

Apache Kafka

Partitioned, replicated commit log.
Consumers track their own offsets — replay is trivial.
Ecosystem: Kafka Connect for CDC, Schema Registry for evolution, ksqlDB for lightweight stream processing.
Operational complexity is real; managed offerings (Confluent Cloud, Amazon MSK) reduce it.

Amazon Kinesis Data Streams

Fully managed, pay-per-shard.
Tight integration with AWS Lambda, Firehose, and Redshift.
Retention up to 365 days.
Shard splitting and merging require planning under bursty workloads.

Choosing between them: If you are all-in on AWS and want minimal ops, Kinesis is pragmatic. If you need multi-cloud, replay flexibility, or a massive ecosystem, Kafka wins.

Both support at-least-once delivery. Design downstream consumers to be idempotent.

Stage 3 — Processing#

Processing transforms raw events into analytics-ready datasets. The choice between batch and stream processing — or both — shapes the entire architecture.

Batch Processing (Apache Spark)#

Reads a bounded dataset (e.g., one hour of events), transforms it, and writes the result.
Great for backfills, aggregations, and ML feature engineering.
Latency: minutes to hours.

Stream Processing (Apache Flink)#

Processes events continuously with sub-second latency.
Supports event-time windowing, watermarks, and exactly-once semantics via checkpointing.
Ideal for real-time dashboards, alerting, and fraud detection.

Lambda vs Kappa Architecture#

Aspect	Lambda	Kappa
Paths	Batch + speed layer	Single stream layer
Complexity	Two codebases to maintain	One codebase, replay for reprocessing
Reprocessing	Re-run batch job	Replay from ingestion log
When to pick	Need heavy batch aggregations	Stream processing handles all workloads

Many teams start with Kappa (Flink reading from Kafka) and add a batch path only when required for cost or complexity reasons.

Common Transformations#

Deduplication — Remove duplicate events using event ID and a state store.
Sessionization — Group events into sessions using gap-based windowing (e.g., 30-minute inactivity timeout).
Enrichment — Join events with dimension tables (user profiles, product catalog).
Aggregation — Pre-compute metrics (DAU, conversion funnels, p99 latency) to speed up queries.

Stage 4 — Storage and Serving#

Processed data lands in an analytical store optimized for fast reads over large datasets.

ClickHouse

Open-source, column-oriented OLAP database.
Blazing fast for time-series and aggregation queries.
MergeTree engine handles inserts and background compaction.
Scales horizontally with sharding and replication.

Google BigQuery

Serverless, pay-per-query.
Separates storage and compute — scan terabytes without provisioning.
Streaming inserts for near-real-time availability.
Native ML (BigQuery ML) for in-warehouse model training.

Other options: Apache Druid for sub-second slice-and-dice, Apache Pinot for user-facing analytics, Snowflake for general-purpose warehousing.

Storage Tiers#

Hot    ──▶  ClickHouse / Druid (recent data, fast queries)
Warm   ──▶  BigQuery / Snowflake (months of data, moderate latency)
Cold   ──▶  S3 / GCS in Parquet (years of data, cheap storage)

Partition by date and use TTL policies to age data between tiers automatically.

Real-Time vs Batch Analytics#

The distinction matters for SLA and cost:

Real-time (seconds): Use Flink writing to ClickHouse or Druid. Supports live dashboards, anomaly detection, and operational monitoring.
Near-real-time (minutes): Use Spark Structured Streaming or Flink with larger windows. Suitable for most product analytics.
Batch (hours): Use Spark or dbt running on a schedule. Best for historical reports, cohort analysis, and cost-sensitive workloads.

Most organizations need a blend. Operational metrics demand real-time; marketing attribution can wait for a nightly batch.

Dashboards and Visualization#

The serving layer feeds dashboards and ad-hoc exploration tools:

Grafana — Open-source, strong ClickHouse and Prometheus integration. Ideal for operational dashboards.
Metabase — Open-source BI with self-service exploration. Connects to BigQuery, Postgres, ClickHouse.
Looker / Tableau — Enterprise BI with governed semantic layers.
Custom dashboards — React + a charting library (Recharts, D3) querying the analytical store directly.

Design dashboards around questions, not data. A dashboard titled "Conversion Funnel" that answers "where do users drop off?" is more useful than a generic "Events" dashboard.

Data Quality#

A pipeline is only as valuable as the trust people place in its output. Data quality requires active enforcement:

Schema validation — Reject or quarantine events that do not match the schema registry.
Freshness monitoring — Alert when a pipeline stage has not produced output within the expected window.
Volume anomaly detection — Track event counts per type. A sudden 80% drop in page_viewed events signals a broken SDK, not a traffic decline.
Semantic tests — Assert business invariants: order_completed count should never exceed checkout_started count.
Data contracts — Producers commit to a schema and SLA; breaking changes require a version bump and migration plan.

Tools like Great Expectations, dbt tests, Monte Carlo, and Soda Core automate these checks.

Putting It All Together#

A production-grade analytics pipeline architecture for a mid-size SaaS product might look like this:

Client SDKs ──▶ API Gateway ──▶ Kafka (3 brokers, Schema Registry)
                                      │
                         ┌────────────┴────────────┐
                         ▼                          ▼
                   Flink (real-time)          Spark (hourly batch)
                         │                          │
                         ▼                          ▼
                    ClickHouse               BigQuery (warehouse)
                         │                          │
                         ▼                          ▼
                   Grafana dashboards        Metabase / Looker

Key takeaways:

Decouple collection, ingestion, processing, and storage so each layer can evolve independently.
Use a schema registry from day one — retrofitting schemas onto unstructured events is painful.
Start with stream processing (Flink or Spark Structured Streaming) and add batch only when economics or complexity demand it.
Invest in data quality early. A fast pipeline that produces wrong numbers erodes trust faster than a slow pipeline that produces correct ones.

Build and explore system design concepts hands-on at codelit.io.

219 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

OpenAI API Request Pipeline

7-stage pipeline from API call to token generation, handling millions of requests per minute.

8 components

Real-Time Collaborative Editor

Notion-like document editor with real-time collaboration, conflict resolution, and rich media.

9 components

Build this architecture

Generate an interactive Analytics Pipeline Architecture in seconds.

Try it in Codelit →

analytics pipeline architecturedata pipelineevent collectionKafkaKinesisSparkFlinkClickHouseBigQueryreal-time analyticssystem design

Analytics Pipeline Architecture: From Event Collection to Real-Time Dashboards

March 28, 2026 7 min readBy Codelit Team Discussion

High-Level Anatomy#

An analytics pipeline moves data through four stages:

Producers ──▶ Ingestion ──▶ Processing ──▶ Storage / Serving
   (apps,       (Kafka,       (Spark,        (ClickHouse,
    SDKs,       Kinesis)       Flink)         BigQuery,
    APIs)                                     dashboards)

Each stage can be scaled, replaced, or extended independently. That modularity is the defining strength of a well-designed pipeline.

Stage 1 — Event Collection#

Events originate from web and mobile clients, backend services, and infrastructure. Key design decisions include:

Schema definition — Use a schema registry (Confluent Schema Registry, Protobuf definitions) so every producer and consumer agrees on event shape.
Client-side SDKs — Buffer events locally and flush in batches to reduce network overhead. Segment, Rudderstack, and Snowplow all follow this pattern.
Server-side enrichment — Attach server-known context (user plan, geo-IP, feature flags) before the event enters the pipeline.
Identifier stitching — Map anonymous IDs to authenticated user IDs as early as possible.

A typical event envelope looks like this:

{
  "event": "page_viewed",
  "timestamp": "2026-03-28T14:22:01Z",
  "user_id": "u_8a3f",
  "anonymous_id": "anon_c91b",
  "properties": {
    "page": "/pricing",
    "referrer": "https://google.com",
    "experiment_variant": "B"
  },
  "context": {
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 ..."
  }
}

Stage 2 — Ingestion#

Ingestion decouples producers from consumers. The two dominant choices are Apache Kafka and Amazon Kinesis.

Apache Kafka

Partitioned, replicated commit log.
Consumers track their own offsets — replay is trivial.
Ecosystem: Kafka Connect for CDC, Schema Registry for evolution, ksqlDB for lightweight stream processing.
Operational complexity is real; managed offerings (Confluent Cloud, Amazon MSK) reduce it.

Amazon Kinesis Data Streams

Fully managed, pay-per-shard.
Tight integration with AWS Lambda, Firehose, and Redshift.
Retention up to 365 days.
Shard splitting and merging require planning under bursty workloads.

Choosing between them: If you are all-in on AWS and want minimal ops, Kinesis is pragmatic. If you need multi-cloud, replay flexibility, or a massive ecosystem, Kafka wins.

Both support at-least-once delivery. Design downstream consumers to be idempotent.

Stage 3 — Processing#

Processing transforms raw events into analytics-ready datasets. The choice between batch and stream processing — or both — shapes the entire architecture.

Batch Processing (Apache Spark)#

Reads a bounded dataset (e.g., one hour of events), transforms it, and writes the result.
Great for backfills, aggregations, and ML feature engineering.
Latency: minutes to hours.

Stream Processing (Apache Flink)#

Processes events continuously with sub-second latency.
Supports event-time windowing, watermarks, and exactly-once semantics via checkpointing.
Ideal for real-time dashboards, alerting, and fraud detection.

Lambda vs Kappa Architecture#

Aspect	Lambda	Kappa
Paths	Batch + speed layer	Single stream layer
Complexity	Two codebases to maintain	One codebase, replay for reprocessing
Reprocessing	Re-run batch job	Replay from ingestion log
When to pick	Need heavy batch aggregations	Stream processing handles all workloads

Many teams start with Kappa (Flink reading from Kafka) and add a batch path only when required for cost or complexity reasons.

Common Transformations#

Deduplication — Remove duplicate events using event ID and a state store.
Sessionization — Group events into sessions using gap-based windowing (e.g., 30-minute inactivity timeout).
Enrichment — Join events with dimension tables (user profiles, product catalog).
Aggregation — Pre-compute metrics (DAU, conversion funnels, p99 latency) to speed up queries.

Stage 4 — Storage and Serving#

Processed data lands in an analytical store optimized for fast reads over large datasets.

ClickHouse

Open-source, column-oriented OLAP database.
Blazing fast for time-series and aggregation queries.
MergeTree engine handles inserts and background compaction.
Scales horizontally with sharding and replication.

Google BigQuery

Serverless, pay-per-query.
Separates storage and compute — scan terabytes without provisioning.
Streaming inserts for near-real-time availability.
Native ML (BigQuery ML) for in-warehouse model training.

Other options: Apache Druid for sub-second slice-and-dice, Apache Pinot for user-facing analytics, Snowflake for general-purpose warehousing.

Storage Tiers#

Hot    ──▶  ClickHouse / Druid (recent data, fast queries)
Warm   ──▶  BigQuery / Snowflake (months of data, moderate latency)
Cold   ──▶  S3 / GCS in Parquet (years of data, cheap storage)

Partition by date and use TTL policies to age data between tiers automatically.

Real-Time vs Batch Analytics#

The distinction matters for SLA and cost:

Real-time (seconds): Use Flink writing to ClickHouse or Druid. Supports live dashboards, anomaly detection, and operational monitoring.
Near-real-time (minutes): Use Spark Structured Streaming or Flink with larger windows. Suitable for most product analytics.
Batch (hours): Use Spark or dbt running on a schedule. Best for historical reports, cohort analysis, and cost-sensitive workloads.

Most organizations need a blend. Operational metrics demand real-time; marketing attribution can wait for a nightly batch.

Dashboards and Visualization#

The serving layer feeds dashboards and ad-hoc exploration tools:

Grafana — Open-source, strong ClickHouse and Prometheus integration. Ideal for operational dashboards.
Metabase — Open-source BI with self-service exploration. Connects to BigQuery, Postgres, ClickHouse.
Looker / Tableau — Enterprise BI with governed semantic layers.
Custom dashboards — React + a charting library (Recharts, D3) querying the analytical store directly.

Design dashboards around questions, not data. A dashboard titled "Conversion Funnel" that answers "where do users drop off?" is more useful than a generic "Events" dashboard.

Data Quality#

A pipeline is only as valuable as the trust people place in its output. Data quality requires active enforcement:

Schema validation — Reject or quarantine events that do not match the schema registry.
Freshness monitoring — Alert when a pipeline stage has not produced output within the expected window.
Volume anomaly detection — Track event counts per type. A sudden 80% drop in page_viewed events signals a broken SDK, not a traffic decline.
Semantic tests — Assert business invariants: order_completed count should never exceed checkout_started count.
Data contracts — Producers commit to a schema and SLA; breaking changes require a version bump and migration plan.

Tools like Great Expectations, dbt tests, Monte Carlo, and Soda Core automate these checks.

Putting It All Together#

A production-grade analytics pipeline architecture for a mid-size SaaS product might look like this:

Client SDKs ──▶ API Gateway ──▶ Kafka (3 brokers, Schema Registry)
                                      │
                         ┌────────────┴────────────┐
                         ▼                          ▼
                   Flink (real-time)          Spark (hourly batch)
                         │                          │
                         ▼                          ▼
                    ClickHouse               BigQuery (warehouse)
                         │                          │
                         ▼                          ▼
                   Grafana dashboards        Metabase / Looker

Key takeaways:

Decouple collection, ingestion, processing, and storage so each layer can evolve independently.
Use a schema registry from day one — retrofitting schemas onto unstructured events is painful.
Start with stream processing (Flink or Spark Structured Streaming) and add batch only when economics or complexity demand it.
Invest in data quality early. A fast pipeline that produces wrong numbers erodes trust faster than a slow pipeline that produces correct ones.

Build and explore system design concepts hands-on at codelit.io.

219 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

Build this architecture

Generate an interactive Analytics Pipeline Architecture in seconds.

Try it in Codelit →

Analytics Pipeline Architecture: From Event Collection to Real-Time Dashboards

High-Level Anatomy#

Stage 1 — Event Collection#

Stage 2 — Ingestion#

Stage 3 — Processing#

Batch Processing (Apache Spark)#

Stream Processing (Apache Flink)#

Lambda vs Kappa Architecture#

Common Transformations#

Stage 4 — Storage and Serving#

Storage Tiers#

Real-Time vs Batch Analytics#

Dashboards and Visualization#

Data Quality#

Putting It All Together#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Uber Real-Time Location System

OpenAI API Request Pipeline

Real-Time Collaborative Editor

Build this architecture

Analytics Pipeline Architecture: From Event Collection to Real-Time Dashboards

High-Level Anatomy#

Stage 1 — Event Collection#

Stage 2 — Ingestion#

Stage 3 — Processing#

Batch Processing (Apache Spark)#

Stream Processing (Apache Flink)#

Lambda vs Kappa Architecture#

Common Transformations#

Stage 4 — Storage and Serving#

Storage Tiers#

Real-Time vs Batch Analytics#

Dashboards and Visualization#

Data Quality#

Putting It All Together#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Uber Real-Time Location System

OpenAI API Request Pipeline

Real-Time Collaborative Editor

Build this architecture