Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines
A data warehouse is a system optimized for analytical queries over large volumes of historical data. Unlike operational databases that serve transactional workloads, a data warehouse is designed to answer questions like "What were total sales by region last quarter?" in seconds across billions of rows.
OLTP vs OLAP#
Understanding the distinction between these two workload types is the starting point for warehouse design.
OLTP (Online Transaction Processing) systems handle high volumes of short, atomic transactions — inserts, updates, deletes. They are optimized for row-level access, normalized schemas, and low-latency writes. Examples: PostgreSQL, MySQL, DynamoDB.
OLAP (Online Analytical Processing) systems handle complex queries that scan large volumes of data, aggregate across dimensions, and join multiple tables. They are optimized for read-heavy, column-oriented access patterns. Examples: Redshift, BigQuery, Snowflake.
| Property | OLTP | OLAP |
|---|---|---|
| Query pattern | Point lookups, small updates | Full scans, aggregations |
| Schema | Normalized (3NF) | Denormalized (star/snowflake) |
| Optimization | Write throughput, latency | Read throughput, compression |
| Data volume | Current state | Historical accumulation |
Fact and Dimension Tables#
Data warehouses organize data into fact tables and dimension tables.
Fact tables store measurable events — sales transactions, page views, shipments. Each row represents one event and contains:
- Foreign keys to dimension tables
- Numeric measures (revenue, quantity, duration)
Dimension tables store descriptive context — who, what, where, when. They contain:
- A surrogate key (integer primary key)
- Descriptive attributes (customer name, product category, store city)
fact_sales
sale_id | date_key | product_key | store_key | quantity | revenue
dim_product
product_key | product_name | category | brand | unit_price
dim_store
store_key | store_name | city | state | region
dim_date
date_key | full_date | day_of_week | month | quarter | year
Fact tables are typically narrow (few columns, many rows) and dimension tables are wide (many columns, fewer rows).
Star Schema vs Snowflake Schema#
Star schema places the fact table at the center with dimension tables radiating outward. Dimensions are fully denormalized — all attributes live in a single table per dimension.
dim_date
|
dim_store--fact_sales--dim_product
|
dim_customer
Snowflake schema normalizes dimensions into sub-dimensions. A dim_product table might reference a separate dim_category table, which references a dim_department table.
Trade-offs:
| Aspect | Star | Snowflake |
|---|---|---|
| Query simplicity | Fewer joins | More joins |
| Storage | More redundancy | Less redundancy |
| Query performance | Faster (fewer joins) | Slower (more joins) |
| Maintenance | Easier | Harder to manage hierarchies |
In practice, star schema is preferred for most data warehouses. The storage cost of denormalization is negligible compared to the query performance and simplicity gains.
ETL Pipelines#
ETL (Extract, Transform, Load) is the process of moving data from source systems into the warehouse.
Extract — Pull data from operational databases, APIs, log files, and third-party services. Use change data capture (CDC) for incremental extraction rather than full dumps.
Transform — Clean, deduplicate, conform, and reshape data. Apply business rules, standardize formats, and compute derived fields. This is where you map source schemas to the warehouse star schema.
Load — Write transformed data into the warehouse. Use bulk loading for efficiency. Append new facts; update dimensions using slowly changing dimension (SCD) strategies.
Modern warehouses increasingly use ELT — load raw data first, then transform inside the warehouse using SQL. This leverages the warehouse's MPP engine for transformation rather than an external ETL server.
Sources -> Extract -> Staging Area -> Transform -> Load -> Warehouse
|
Sources -> Extract -> Load -> Warehouse -> Transform (ELT)
Tools in this space include Apache Airflow (orchestration), dbt (SQL transformations), Fivetran (managed extraction), and Spark (large-scale processing).
Columnar Storage#
Traditional row-oriented databases store all columns of a row together on disk. Columnar databases store all values of a single column together.
Why columnar wins for analytics:
- Compression — Column values share the same data type and often repeat. Run-length encoding, dictionary encoding, and delta encoding achieve 5–10x compression.
- I/O reduction — A query selecting 3 columns out of 100 reads only those 3 columns from disk, not the other 97.
- Vectorized execution — Modern CPUs process arrays of same-type values much faster than heterogeneous row tuples.
Columnar formats include Apache Parquet (files), Apache ORC (Hive), and the internal formats of Redshift, BigQuery, and Snowflake.
MPP Engines#
Massively Parallel Processing (MPP) engines distribute query execution across many nodes. Each node processes a slice of the data, and results are merged.
Amazon Redshift#
- Columnar, MPP, based on PostgreSQL dialect.
- Data distributed across nodes using distribution keys (KEY, ALL, EVEN).
- Sort keys determine on-disk ordering for range scans.
- Spectrum extends queries to S3 data without loading it into the cluster.
Google BigQuery#
- Serverless — no cluster management. Pay per query scanned.
- Separates storage (Colossus) from compute (Dremel/Borg).
- Automatic optimization: partitioning, clustering, and caching.
- Slot-based execution model scales transparently.
Snowflake#
- Separates storage, compute (virtual warehouses), and cloud services.
- Multiple virtual warehouses can query the same data independently — no contention.
- Automatic clustering, time travel (historical queries), and zero-copy cloning.
- Multi-cloud: runs on AWS, GCP, and Azure.
| Engine | Model | Scaling | Best For |
|---|---|---|---|
| Redshift | Provisioned cluster | Manual node add/remove | Predictable, steady workloads |
| BigQuery | Serverless | Automatic | Bursty, ad-hoc analytics |
| Snowflake | Separated compute | Independent warehouse scaling | Mixed workloads, multi-team |
Slowly Changing Dimensions (SCD)#
Dimension attributes change over time. A customer moves to a new city; a product gets reclassified. SCD strategies determine how to handle these changes.
Type 1 — Overwrite: Replace the old value with the new one. No history preserved. Simple but loses context for historical analysis.
Type 2 — Add new row: Insert a new row with the updated attributes, a new surrogate key, and effective date columns (valid_from, valid_to). The fact table references the surrogate key that was active at transaction time, preserving historical accuracy.
product_key | product_name | category | valid_from | valid_to
1001 | Widget X | Electronics | 2024-01-01 | 2025-06-30
1002 | Widget X | IoT Devices | 2025-07-01 | 9999-12-31
Type 3 — Add new column: Store both old and new values in separate columns (current_category, previous_category). Limited to tracking one change.
Type 2 is the most common approach in production warehouses because it preserves full history without complicating the fact table.
Key Takeaways#
- Data warehouses are optimized for analytical (OLAP) workloads — large scans, aggregations, and joins over historical data.
- Star schema with denormalized dimensions is the standard pattern. Snowflake schema adds normalization at the cost of query complexity.
- Fact tables store events with measures; dimension tables store descriptive context.
- ETL pipelines extract from sources, transform to the warehouse schema, and load. Modern ELT pushes transformation into the warehouse engine.
- Columnar storage delivers compression and I/O reduction that make analytical queries fast.
- MPP engines (Redshift, BigQuery, Snowflake) distribute processing across nodes. Choose based on your scaling model and workload pattern.
- Slowly changing dimensions (especially Type 2) preserve historical accuracy for time-variant analysis.
Build and explore system design concepts hands-on at codelit.io.
259 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Data Mesh Architecture: Domain Ownership, Data as Product & Federated Governance
8 min read
real time analyticsReal-Time Analytics Architecture: From Events to Dashboards
7 min read
analytics pipeline architectureAnalytics Pipeline Architecture: From Event Collection to Real-Time Dashboards
7 min read
Try these templates
Build this architecture
Generate an interactive Data Warehouse Architecture in seconds.
Try it in Codelit →
Comments