Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines
A data warehouse is a system optimized for analytical queries over large volumes of historical data. Unlike operational databases that serve transactional workloads, a data warehouse is designed to answer questions like "What were total sales by region last quarter?" in seconds across billions of rows.
OLTP vs OLAP#
Understanding the distinction between these two workload types is the starting point for warehouse design.
OLTP (Online Transaction Processing) systems handle high volumes of short, atomic transactions — inserts, updates, deletes. They are optimized for row-level access, normalized schemas, and low-latency writes. Examples: PostgreSQL, MySQL, DynamoDB.
OLAP (Online Analytical Processing) systems handle complex queries that scan large volumes of data, aggregate across dimensions, and join multiple tables. They are optimized for read-heavy, column-oriented access patterns. Examples: Redshift, BigQuery, Snowflake.
| Property | OLTP | OLAP |
|---|---|---|
| Query pattern | Point lookups, small updates | Full scans, aggregations |
| Schema | Normalized (3NF) | Denormalized (star/snowflake) |
| Optimization | Write throughput, latency | Read throughput, compression |
| Data volume | Current state | Historical accumulation |
Fact and Dimension Tables#
Data warehouses organize data into fact tables and dimension tables.
Fact tables store measurable events — sales transactions, page views, shipments. Each row represents one event and contains:
- Foreign keys to dimension tables
- Numeric measures (revenue, quantity, duration)
Dimension tables store descriptive context — who, what, where, when. They contain:
- A surrogate key (integer primary key)
- Descriptive attributes (customer name, product category, store city)
fact_sales
sale_id | date_key | product_key | store_key | quantity | revenue
dim_product
product_key | product_name | category | brand | unit_price
dim_store
store_key | store_name | city | state | region
dim_date
date_key | full_date | day_of_week | month | quarter | year
Fact tables are typically narrow (few columns, many rows) and dimension tables are wide (many columns, fewer rows).
Star Schema vs Snowflake Schema#
Star schema places the fact table at the center with dimension tables radiating outward. Dimensions are fully denormalized — all attributes live in a single table per dimension.
dim_date
|
dim_store--fact_sales--dim_product
|
dim_customer
Snowflake schema normalizes dimensions into sub-dimensions. A dim_product table might reference a separate dim_category table, which references a dim_department table.
Trade-offs:
| Aspect | Star | Snowflake |
|---|---|---|
| Query simplicity | Fewer joins | More joins |
| Storage | More redundancy | Less redundancy |
| Query performance | Faster (fewer joins) | Slower (more joins) |
| Maintenance | Easier | Harder to manage hierarchies |
In practice, star schema is preferred for most data warehouses. The storage cost of denormalization is negligible compared to the query performance and simplicity gains.
ETL Pipelines#
ETL (Extract, Transform, Load) is the process of moving data from source systems into the warehouse.
Extract — Pull data from operational databases, APIs, log files, and third-party services. Use change data capture (CDC) for incremental extraction rather than full dumps.
Transform — Clean, deduplicate, conform, and reshape data. Apply business rules, standardize formats, and compute derived fields. This is where you map source schemas to the warehouse star schema.
Load — Write transformed data into the warehouse. Use bulk loading for efficiency. Append new facts; update dimensions using slowly changing dimension (SCD) strategies.
Modern warehouses increasingly use ELT — load raw data first, then transform inside the warehouse using SQL. This leverages the warehouse's MPP engine for transformation rather than an external ETL server.
Sources -> Extract -> Staging Area -> Transform -> Load -> Warehouse
|
Sources -> Extract -> Load -> Warehouse -> Transform (ELT)
Tools in this space include Apache Airflow (orchestration), dbt (SQL transformations), Fivetran (managed extraction), and Spark (large-scale processing).
Columnar Storage#
Traditional row-oriented databases store all columns of a row together on disk. Columnar databases store all values of a single column together.
Why columnar wins for analytics:
- Compression — Column values share the same data type and often repeat. Run-length encoding, dictionary encoding, and delta encoding achieve 5–10x compression.
- I/O reduction — A query selecting 3 columns out of 100 reads only those 3 columns from disk, not the other 97.
- Vectorized execution — Modern CPUs process arrays of same-type values much faster than heterogeneous row tuples.
Columnar formats include Apache Parquet (files), Apache ORC (Hive), and the internal formats of Redshift, BigQuery, and Snowflake.
MPP Engines#
Massively Parallel Processing (MPP) engines distribute query execution across many nodes. Each node processes a slice of the data, and results are merged.
Amazon Redshift#
- Columnar, MPP, based on PostgreSQL dialect.
- Data distributed across nodes using distribution keys (KEY, ALL, EVEN).
- Sort keys determine on-disk ordering for range scans.
- Spectrum extends queries to S3 data without loading it into the cluster.
Google BigQuery#
- Serverless — no cluster management. Pay per query scanned.
- Separates storage (Colossus) from compute (Dremel/Borg).
- Automatic optimization: partitioning, clustering, and caching.
- Slot-based execution model scales transparently.
Snowflake#
- Separates storage, compute (virtual warehouses), and cloud services.
- Multiple virtual warehouses can query the same data independently — no contention.
- Automatic clustering, time travel (historical queries), and zero-copy cloning.
- Multi-cloud: runs on AWS, GCP, and Azure.
| Engine | Model | Scaling | Best For |
|---|---|---|---|
| Redshift | Provisioned cluster | Manual node add/remove | Predictable, steady workloads |
| BigQuery | Serverless | Automatic | Bursty, ad-hoc analytics |
| Snowflake | Separated compute | Independent warehouse scaling | Mixed workloads, multi-team |
Slowly Changing Dimensions (SCD)#
Dimension attributes change over time. A customer moves to a new city; a product gets reclassified. SCD strategies determine how to handle these changes.
Type 1 — Overwrite: Replace the old value with the new one. No history preserved. Simple but loses context for historical analysis.
Type 2 — Add new row: Insert a new row with the updated attributes, a new surrogate key, and effective date columns (valid_from, valid_to). The fact table references the surrogate key that was active at transaction time, preserving historical accuracy.
product_key | product_name | category | valid_from | valid_to
1001 | Widget X | Electronics | 2024-01-01 | 2025-06-30
1002 | Widget X | IoT Devices | 2025-07-01 | 9999-12-31
Type 3 — Add new column: Store both old and new values in separate columns (current_category, previous_category). Limited to tracking one change.
Type 2 is the most common approach in production warehouses because it preserves full history without complicating the fact table.
Key Takeaways#
- Data warehouses are optimized for analytical (OLAP) workloads — large scans, aggregations, and joins over historical data.
- Star schema with denormalized dimensions is the standard pattern. Snowflake schema adds normalization at the cost of query complexity.
- Fact tables store events with measures; dimension tables store descriptive context.
- ETL pipelines extract from sources, transform to the warehouse schema, and load. Modern ELT pushes transformation into the warehouse engine.
- Columnar storage delivers compression and I/O reduction that make analytical queries fast.
- MPP engines (Redshift, BigQuery, Snowflake) distribute processing across nodes. Choose based on your scaling model and workload pattern.
- Slowly changing dimensions (especially Type 2) preserve historical accuracy for time-variant analysis.
Build and explore system design concepts hands-on at codelit.io.
259 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive Data Warehouse Architecture in seconds.
Try it in Codelit →
Comments