Analytics Pipeline Architecture: From Event Collection to Real-Time Dashboards
Every modern product needs analytics. Click-streams, purchase events, API latency metrics — the volume grows daily, and the questions stakeholders ask grow more complex. An analytics pipeline architecture turns that firehose of raw events into queryable, trustworthy data that powers dashboards, ML models, and business decisions.
High-Level Anatomy#
An analytics pipeline moves data through four stages:
Producers ──▶ Ingestion ──▶ Processing ──▶ Storage / Serving
(apps, (Kafka, (Spark, (ClickHouse,
SDKs, Kinesis) Flink) BigQuery,
APIs) dashboards)
Each stage can be scaled, replaced, or extended independently. That modularity is the defining strength of a well-designed pipeline.
Stage 1 — Event Collection#
Events originate from web and mobile clients, backend services, and infrastructure. Key design decisions include:
- Schema definition — Use a schema registry (Confluent Schema Registry, Protobuf definitions) so every producer and consumer agrees on event shape.
- Client-side SDKs — Buffer events locally and flush in batches to reduce network overhead. Segment, Rudderstack, and Snowplow all follow this pattern.
- Server-side enrichment — Attach server-known context (user plan, geo-IP, feature flags) before the event enters the pipeline.
- Identifier stitching — Map anonymous IDs to authenticated user IDs as early as possible.
A typical event envelope looks like this:
{
"event": "page_viewed",
"timestamp": "2026-03-28T14:22:01Z",
"user_id": "u_8a3f",
"anonymous_id": "anon_c91b",
"properties": {
"page": "/pricing",
"referrer": "https://google.com",
"experiment_variant": "B"
},
"context": {
"ip": "203.0.113.42",
"user_agent": "Mozilla/5.0 ..."
}
}
Stage 2 — Ingestion#
Ingestion decouples producers from consumers. The two dominant choices are Apache Kafka and Amazon Kinesis.
Apache Kafka
- Partitioned, replicated commit log.
- Consumers track their own offsets — replay is trivial.
- Ecosystem: Kafka Connect for CDC, Schema Registry for evolution, ksqlDB for lightweight stream processing.
- Operational complexity is real; managed offerings (Confluent Cloud, Amazon MSK) reduce it.
Amazon Kinesis Data Streams
- Fully managed, pay-per-shard.
- Tight integration with AWS Lambda, Firehose, and Redshift.
- Retention up to 365 days.
- Shard splitting and merging require planning under bursty workloads.
Choosing between them: If you are all-in on AWS and want minimal ops, Kinesis is pragmatic. If you need multi-cloud, replay flexibility, or a massive ecosystem, Kafka wins.
Both support at-least-once delivery. Design downstream consumers to be idempotent.
Stage 3 — Processing#
Processing transforms raw events into analytics-ready datasets. The choice between batch and stream processing — or both — shapes the entire architecture.
Batch Processing (Apache Spark)#
- Reads a bounded dataset (e.g., one hour of events), transforms it, and writes the result.
- Great for backfills, aggregations, and ML feature engineering.
- Latency: minutes to hours.
Stream Processing (Apache Flink)#
- Processes events continuously with sub-second latency.
- Supports event-time windowing, watermarks, and exactly-once semantics via checkpointing.
- Ideal for real-time dashboards, alerting, and fraud detection.
Lambda vs Kappa Architecture#
| Aspect | Lambda | Kappa |
|---|---|---|
| Paths | Batch + speed layer | Single stream layer |
| Complexity | Two codebases to maintain | One codebase, replay for reprocessing |
| Reprocessing | Re-run batch job | Replay from ingestion log |
| When to pick | Need heavy batch aggregations | Stream processing handles all workloads |
Many teams start with Kappa (Flink reading from Kafka) and add a batch path only when required for cost or complexity reasons.
Common Transformations#
- Deduplication — Remove duplicate events using event ID and a state store.
- Sessionization — Group events into sessions using gap-based windowing (e.g., 30-minute inactivity timeout).
- Enrichment — Join events with dimension tables (user profiles, product catalog).
- Aggregation — Pre-compute metrics (DAU, conversion funnels, p99 latency) to speed up queries.
Stage 4 — Storage and Serving#
Processed data lands in an analytical store optimized for fast reads over large datasets.
ClickHouse
- Open-source, column-oriented OLAP database.
- Blazing fast for time-series and aggregation queries.
- MergeTree engine handles inserts and background compaction.
- Scales horizontally with sharding and replication.
Google BigQuery
- Serverless, pay-per-query.
- Separates storage and compute — scan terabytes without provisioning.
- Streaming inserts for near-real-time availability.
- Native ML (BigQuery ML) for in-warehouse model training.
Other options: Apache Druid for sub-second slice-and-dice, Apache Pinot for user-facing analytics, Snowflake for general-purpose warehousing.
Storage Tiers#
Hot ──▶ ClickHouse / Druid (recent data, fast queries)
Warm ──▶ BigQuery / Snowflake (months of data, moderate latency)
Cold ──▶ S3 / GCS in Parquet (years of data, cheap storage)
Partition by date and use TTL policies to age data between tiers automatically.
Real-Time vs Batch Analytics#
The distinction matters for SLA and cost:
- Real-time (seconds): Use Flink writing to ClickHouse or Druid. Supports live dashboards, anomaly detection, and operational monitoring.
- Near-real-time (minutes): Use Spark Structured Streaming or Flink with larger windows. Suitable for most product analytics.
- Batch (hours): Use Spark or dbt running on a schedule. Best for historical reports, cohort analysis, and cost-sensitive workloads.
Most organizations need a blend. Operational metrics demand real-time; marketing attribution can wait for a nightly batch.
Dashboards and Visualization#
The serving layer feeds dashboards and ad-hoc exploration tools:
- Grafana — Open-source, strong ClickHouse and Prometheus integration. Ideal for operational dashboards.
- Metabase — Open-source BI with self-service exploration. Connects to BigQuery, Postgres, ClickHouse.
- Looker / Tableau — Enterprise BI with governed semantic layers.
- Custom dashboards — React + a charting library (Recharts, D3) querying the analytical store directly.
Design dashboards around questions, not data. A dashboard titled "Conversion Funnel" that answers "where do users drop off?" is more useful than a generic "Events" dashboard.
Data Quality#
A pipeline is only as valuable as the trust people place in its output. Data quality requires active enforcement:
- Schema validation — Reject or quarantine events that do not match the schema registry.
- Freshness monitoring — Alert when a pipeline stage has not produced output within the expected window.
- Volume anomaly detection — Track event counts per type. A sudden 80% drop in
page_viewedevents signals a broken SDK, not a traffic decline. - Semantic tests — Assert business invariants:
order_completedcount should never exceedcheckout_startedcount. - Data contracts — Producers commit to a schema and SLA; breaking changes require a version bump and migration plan.
Tools like Great Expectations, dbt tests, Monte Carlo, and Soda Core automate these checks.
Putting It All Together#
A production-grade analytics pipeline architecture for a mid-size SaaS product might look like this:
Client SDKs ──▶ API Gateway ──▶ Kafka (3 brokers, Schema Registry)
│
┌────────────┴────────────┐
▼ ▼
Flink (real-time) Spark (hourly batch)
│ │
▼ ▼
ClickHouse BigQuery (warehouse)
│ │
▼ ▼
Grafana dashboards Metabase / Looker
Key takeaways:
- Decouple collection, ingestion, processing, and storage so each layer can evolve independently.
- Use a schema registry from day one — retrofitting schemas onto unstructured events is painful.
- Start with stream processing (Flink or Spark Structured Streaming) and add batch only when economics or complexity demand it.
- Invest in data quality early. A fast pipeline that produces wrong numbers erodes trust faster than a slow pipeline that produces correct ones.
Build and explore system design concepts hands-on at codelit.io.
219 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsOpenAI API Request Pipeline
7-stage pipeline from API call to token generation, handling millions of requests per minute.
8 componentsReal-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsBuild this architecture
Generate an interactive Analytics Pipeline Architecture in seconds.
Try it in Codelit →
Comments