Real-Time Analytics Architecture: From Events to Dashboards
Batch analytics with nightly ETL jobs served us well for decades. But modern products demand real-time insights — fraud detection in milliseconds, live dashboards updating every second, A/B test results as they happen. Building a real-time analytics architecture requires rethinking every layer of the data pipeline, from ingestion to visualization.
High-Level Architecture#
A real-time analytics system has four layers:
┌────────────────────────────────────────────────────────┐
│ Data Sources │
│ (clickstream, transactions, IoT sensors, app events) │
└───────────────────────┬────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Stream Ingestion │
│ (Kafka, Kinesis, Pulsar, Redpanda) │
└───────────────────────┬────────────────────────────────┘
│
┌───────┴────────┐
▼ ▼
┌──────────────────┐ ┌──────────────────────┐
│ Stream Processing │ │ Direct Ingestion │
│ (Flink, Spark │ │ (Kafka Connect, │
│ Streaming) │ │ native connectors) │
└────────┬─────────┘ └──────────┬───────────┘
│ │
└───────────┬───────────┘
▼
┌────────────────────────────────────────────────────────┐
│ OLAP / Analytics Store │
│ (ClickHouse, Apache Druid, Apache Pinot, StarRocks) │
└───────────────────────┬────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────┐
│ Query / Visualization │
│ (Grafana, Superset, Metabase, custom dashboards) │
└────────────────────────────────────────────────────────┘
Stream Processing Layer#
Stream processors transform, enrich, and aggregate events before they land in the analytics store.
Apache Flink is the most capable open-source stream processor:
-- Flink SQL: Compute real-time revenue per product category
-- over tumbling 1-minute windows
SELECT
category,
TUMBLE_START(event_time, INTERVAL '1' MINUTE) AS window_start,
COUNT(*) AS order_count,
SUM(amount) AS revenue
FROM orders
GROUP BY
category,
TUMBLE(event_time, INTERVAL '1' MINUTE);
When to use stream processing vs. direct ingestion:
| Scenario | Approach |
|---|---|
| Raw event storage for ad-hoc queries | Direct ingestion |
| Pre-aggregation to reduce storage | Stream processing |
| Enrichment with dimension data | Stream processing |
| Filtering / PII redaction | Stream processing |
| Simple append-only logging | Direct ingestion |
Choosing an OLAP Engine#
The OLAP engine is the core of your analytics architecture. Here is how the major options compare:
ClickHouse#
ClickHouse is a column-oriented database optimized for analytical queries on large datasets.
Strengths: Fastest single-node performance, excellent compression, rich SQL dialect, materialized views, wide ecosystem.
Architecture: Shared-nothing, each node stores and queries its own data. Replication via ZooKeeper or ClickHouse Keeper.
-- ClickHouse: Create a table optimized for time-series analytics
CREATE TABLE events (
event_id UUID,
user_id UInt64,
event_type LowCardinality(String),
properties Map(String, String),
amount Decimal64(2),
event_time DateTime64(3),
date Date MATERIALIZED toDate(event_time)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_time)
ORDER BY (event_type, user_id, event_time)
TTL event_time + INTERVAL 90 DAY;
Apache Druid#
Druid is designed for sub-second OLAP queries on streaming data with high concurrency.
Strengths: Native Kafka ingestion, automatic data tiering (hot/warm/cold), high query concurrency, segment-level indexing.
Architecture: Separated ingestion, query, and storage nodes. Segments are immutable and stored in deep storage (S3/HDFS).
Apache Pinot#
Pinot was built at LinkedIn for user-facing real-time analytics at massive scale.
Strengths: Upsert support, star-tree index for pre-aggregation, native Kafka consumer, designed for high-QPS user-facing queries.
Comparison#
| Feature | ClickHouse | Druid | Pinot |
|---|---|---|---|
| Query latency (p99) | 10–500 ms | 50–500 ms | 10–200 ms |
| Ingestion latency | Seconds | Seconds | Seconds |
| SQL support | Full SQL | Druid SQL (limited) | Multi-stage SQL |
| Upserts | ReplacingMergeTree | Limited | Native |
| Joins | Yes | Limited | Multi-stage |
| Managed offering | ClickHouse Cloud | Imply | StarTree |
| Best for | Ad-hoc analytics | High-concurrency | User-facing analytics |
Materialized Views#
Materialized views pre-compute aggregations at ingestion time, trading storage for query speed:
-- ClickHouse: Materialized view for real-time dashboard
CREATE MATERIALIZED VIEW revenue_by_minute
ENGINE = SummingMergeTree()
ORDER BY (category, minute)
AS SELECT
category,
toStartOfMinute(event_time) AS minute,
count() AS order_count,
sum(amount) AS revenue
FROM events
WHERE event_type = 'purchase'
GROUP BY category, minute;
-- Query the materialized view — instant response
SELECT
category,
sum(order_count) AS orders,
sum(revenue) AS total_revenue
FROM revenue_by_minute
WHERE minute >= now() - INTERVAL 1 HOUR
GROUP BY category
ORDER BY total_revenue DESC;
Materialized view design principles:
- Pre-aggregate the most common dashboard queries — Identify the top 5–10 queries and create views for them.
- Use appropriate merge engines —
SummingMergeTreefor additive metrics,AggregatingMergeTreefor complex aggregations like uniques. - Partition by time — Enables efficient TTL-based cleanup and partition pruning.
- Keep raw data — Materialized views complement raw tables; do not replace them. Ad-hoc exploration needs raw granularity.
Approximate Algorithms#
When exact counts on billions of rows are too expensive, approximate algorithms trade precision for speed:
HyperLogLog — Count Distinct#
Estimates the number of unique values using ~12 KB of memory regardless of cardinality:
-- ClickHouse: Approximate unique users per day
SELECT
toDate(event_time) AS day,
uniqHLL12(user_id) AS approx_unique_users
FROM events
GROUP BY day
ORDER BY day DESC
LIMIT 30;
Typical error rate: 0.5–2%. For dashboards showing "1.2M unique users today," this is perfectly acceptable.
Count-Min Sketch — Frequency Estimation#
Estimates how often a specific item appears in a stream. Useful for top-N queries:
-- Approximate top 10 most viewed pages (last hour)
SELECT
page_url,
count() AS views
FROM events
WHERE event_type = 'pageview'
AND event_time >= now() - INTERVAL 1 HOUR
GROUP BY page_url
ORDER BY views DESC
LIMIT 10;
T-Digest — Approximate Percentiles#
Computes approximate quantiles (p50, p95, p99) with high accuracy at the tails:
SELECT
quantilesTDigest(0.5, 0.95, 0.99)(response_time_ms) AS latency_percentiles
FROM api_events
WHERE event_time >= now() - INTERVAL 5 MINUTE;
Dashboard Architecture#
The visualization layer connects users to the analytics store. Key design decisions:
Query Caching#
Dashboards with many viewers should not each fire independent queries:
┌──────────┐ ┌─────────────┐ ┌────────────┐
│ Dashboard │────▶│ Query Cache │────▶│ ClickHouse │
│ (browser) │ │ (Redis/ │ │ │
│ │◀────│ in-memory) │◀────│ │
└──────────┘ └─────────────┘ └────────────┘
Cache key = hash(query + time_bucket)
TTL = refresh interval (e.g., 10 seconds)
Push vs. Poll#
| Approach | Mechanism | Best for |
|---|---|---|
| Polling | Dashboard queries every N sec | Simple, tolerates N-sec lag |
| WebSocket | Server pushes on data change | Sub-second updates |
| SSE | Server-sent events | One-directional push |
Multi-Tenant Isolation#
For SaaS analytics dashboards serving many customers:
- Row-level security — Filter queries by
tenant_idat the query layer. - Separate tables/databases — Stronger isolation but higher operational cost.
- Query quotas — Rate-limit queries per tenant to prevent noisy neighbors.
Ingestion Patterns#
Exactly-Once Semantics#
Achieving exactly-once delivery from Kafka to ClickHouse:
Kafka Consumer Group
│
▼
┌──────────────────────┐
│ Batch events (10s │
│ or 10K rows) │
├──────────────────────┤
│ Insert batch into │
│ ClickHouse │
├──────────────────────┤
│ Commit Kafka offset │
│ AFTER insert success│
└──────────────────────┘
If the consumer crashes after insert but before commit, the batch is re-inserted. Use ReplacingMergeTree with a deduplication key to handle duplicates:
CREATE TABLE events (
event_id UUID,
-- ... other columns
) ENGINE = ReplacingMergeTree()
ORDER BY event_id;
Backfill Strategy#
Real-time systems need a way to reprocess historical data:
- Dual-write — Write to both the real-time path and a batch path (S3/data lake).
- Replay — Re-read from Kafka (if retention allows) or from the data lake.
- Swap — Load backfilled data into a shadow table, then swap with an atomic
EXCHANGE TABLES(ClickHouse) or segment replacement (Druid).
Scaling Considerations#
- Ingestion throughput — ClickHouse handles 1–5 million rows/second per node. Scale horizontally with sharding.
- Query concurrency — Druid and Pinot are designed for 1000+ QPS. ClickHouse handles fewer concurrent queries but each is faster.
- Storage — Column compression typically achieves 5–10x reduction. Plan capacity on compressed size.
- Retention — Use TTL to automatically drop old data. Archive to object storage for compliance.
Real-time analytics is no longer a luxury — users expect instant feedback, and businesses need immediate visibility. The combination of stream processing, modern OLAP engines, and smart pre-aggregation makes it achievable at reasonable cost.
That is article #389 on Codelit. Browse all articles or explore the platform to level up your engineering skills.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsReal-Time Collaborative Editor
Notion-like document editor with real-time collaboration, conflict resolution, and rich media.
9 componentsReal-Time Analytics Dashboard
Live analytics platform with event ingestion, stream processing, and interactive dashboards.
8 components
Comments