data warehouse architecturestar schemaETL pipelineOLAPcolumnar storageBigQuerySnowflakedata engineering

Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines

March 29, 2026 7 min readBy Codelit Team Discussion

A data warehouse is a system optimized for analytical queries over large volumes of historical data. Unlike operational databases that serve transactional workloads, a data warehouse is designed to answer questions like "What were total sales by region last quarter?" in seconds across billions of rows.

OLTP vs OLAP#

Understanding the distinction between these two workload types is the starting point for warehouse design.

OLTP (Online Transaction Processing) systems handle high volumes of short, atomic transactions — inserts, updates, deletes. They are optimized for row-level access, normalized schemas, and low-latency writes. Examples: PostgreSQL, MySQL, DynamoDB.

OLAP (Online Analytical Processing) systems handle complex queries that scan large volumes of data, aggregate across dimensions, and join multiple tables. They are optimized for read-heavy, column-oriented access patterns. Examples: Redshift, BigQuery, Snowflake.

Property	OLTP	OLAP
Query pattern	Point lookups, small updates	Full scans, aggregations
Schema	Normalized (3NF)	Denormalized (star/snowflake)
Optimization	Write throughput, latency	Read throughput, compression
Data volume	Current state	Historical accumulation

Fact and Dimension Tables#

Data warehouses organize data into fact tables and dimension tables.

Fact tables store measurable events — sales transactions, page views, shipments. Each row represents one event and contains:

Foreign keys to dimension tables
Numeric measures (revenue, quantity, duration)

Dimension tables store descriptive context — who, what, where, when. They contain:

A surrogate key (integer primary key)
Descriptive attributes (customer name, product category, store city)

fact_sales
  sale_id | date_key | product_key | store_key | quantity | revenue

dim_product
  product_key | product_name | category | brand | unit_price

dim_store
  store_key | store_name | city | state | region

dim_date
  date_key | full_date | day_of_week | month | quarter | year

Fact tables are typically narrow (few columns, many rows) and dimension tables are wide (many columns, fewer rows).

Star Schema vs Snowflake Schema#

Star schema places the fact table at the center with dimension tables radiating outward. Dimensions are fully denormalized — all attributes live in a single table per dimension.

        dim_date
           |
dim_store--fact_sales--dim_product
           |
       dim_customer

Snowflake schema normalizes dimensions into sub-dimensions. A dim_product table might reference a separate dim_category table, which references a dim_department table.

Trade-offs:

Aspect	Star	Snowflake
Query simplicity	Fewer joins	More joins
Storage	More redundancy	Less redundancy
Query performance	Faster (fewer joins)	Slower (more joins)
Maintenance	Easier	Harder to manage hierarchies

In practice, star schema is preferred for most data warehouses. The storage cost of denormalization is negligible compared to the query performance and simplicity gains.

ETL Pipelines#

ETL (Extract, Transform, Load) is the process of moving data from source systems into the warehouse.

Extract — Pull data from operational databases, APIs, log files, and third-party services. Use change data capture (CDC) for incremental extraction rather than full dumps.

Transform — Clean, deduplicate, conform, and reshape data. Apply business rules, standardize formats, and compute derived fields. This is where you map source schemas to the warehouse star schema.

Load — Write transformed data into the warehouse. Use bulk loading for efficiency. Append new facts; update dimensions using slowly changing dimension (SCD) strategies.

Modern warehouses increasingly use ELT — load raw data first, then transform inside the warehouse using SQL. This leverages the warehouse's MPP engine for transformation rather than an external ETL server.

Sources -> Extract -> Staging Area -> Transform -> Load -> Warehouse
                                                          |
Sources -> Extract -> Load -> Warehouse -> Transform (ELT)

Tools in this space include Apache Airflow (orchestration), dbt (SQL transformations), Fivetran (managed extraction), and Spark (large-scale processing).

Columnar Storage#

Traditional row-oriented databases store all columns of a row together on disk. Columnar databases store all values of a single column together.

Why columnar wins for analytics:

Compression — Column values share the same data type and often repeat. Run-length encoding, dictionary encoding, and delta encoding achieve 5–10x compression.
I/O reduction — A query selecting 3 columns out of 100 reads only those 3 columns from disk, not the other 97.
Vectorized execution — Modern CPUs process arrays of same-type values much faster than heterogeneous row tuples.

Columnar formats include Apache Parquet (files), Apache ORC (Hive), and the internal formats of Redshift, BigQuery, and Snowflake.

MPP Engines#

Massively Parallel Processing (MPP) engines distribute query execution across many nodes. Each node processes a slice of the data, and results are merged.

Amazon Redshift#

Columnar, MPP, based on PostgreSQL dialect.
Data distributed across nodes using distribution keys (KEY, ALL, EVEN).
Sort keys determine on-disk ordering for range scans.
Spectrum extends queries to S3 data without loading it into the cluster.

Google BigQuery#

Serverless — no cluster management. Pay per query scanned.
Separates storage (Colossus) from compute (Dremel/Borg).
Automatic optimization: partitioning, clustering, and caching.
Slot-based execution model scales transparently.

Snowflake#

Separates storage, compute (virtual warehouses), and cloud services.
Multiple virtual warehouses can query the same data independently — no contention.
Automatic clustering, time travel (historical queries), and zero-copy cloning.
Multi-cloud: runs on AWS, GCP, and Azure.

Engine	Model	Scaling	Best For
Redshift	Provisioned cluster	Manual node add/remove	Predictable, steady workloads
BigQuery	Serverless	Automatic	Bursty, ad-hoc analytics
Snowflake	Separated compute	Independent warehouse scaling	Mixed workloads, multi-team

Slowly Changing Dimensions (SCD)#

Dimension attributes change over time. A customer moves to a new city; a product gets reclassified. SCD strategies determine how to handle these changes.

Type 1 — Overwrite: Replace the old value with the new one. No history preserved. Simple but loses context for historical analysis.

Type 2 — Add new row: Insert a new row with the updated attributes, a new surrogate key, and effective date columns (valid_from, valid_to). The fact table references the surrogate key that was active at transaction time, preserving historical accuracy.

product_key | product_name | category    | valid_from | valid_to
1001        | Widget X     | Electronics | 2024-01-01 | 2025-06-30
1002        | Widget X     | IoT Devices | 2025-07-01 | 9999-12-31

Type 3 — Add new column: Store both old and new values in separate columns (current_category, previous_category). Limited to tracking one change.

Type 2 is the most common approach in production warehouses because it preserves full history without complicating the fact table.

Key Takeaways#

Data warehouses are optimized for analytical (OLAP) workloads — large scans, aggregations, and joins over historical data.
Star schema with denormalized dimensions is the standard pattern. Snowflake schema adds normalization at the cost of query complexity.
Fact tables store events with measures; dimension tables store descriptive context.
ETL pipelines extract from sources, transform to the warehouse schema, and load. Modern ELT pushes transformation into the warehouse engine.
Columnar storage delivers compression and I/O reduction that make analytical queries fast.
MPP engines (Redshift, BigQuery, Snowflake) distribute processing across nodes. Choose based on your scaling model and workload pattern.
Slowly changing dimensions (especially Type 2) preserve historical accuracy for time-variant analysis.

Build and explore system design concepts hands-on at codelit.io.

259 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Agentic Data Pipeline Workflow

2 min read

data mesh

Data Mesh Architecture: Domain Ownership, Data as Product & Federated Governance

8 min read

real time analytics

Real-Time Analytics Architecture: From Events to Dashboards

7 min read

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Warehouse & Inventory Management

WMS with receiving, bin locations, pick/pack/ship, real-time stock tracking, and barcode scanning.

8 components

Build this architecture

Generate an interactive Data Warehouse Architecture in seconds.

Try it in Codelit →

data warehouse architecturestar schemaETL pipelineOLAPcolumnar storageBigQuerySnowflakedata engineering

Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines

March 29, 2026 7 min readBy Codelit Team Discussion

OLTP vs OLAP#

Understanding the distinction between these two workload types is the starting point for warehouse design.

Property	OLTP	OLAP
Query pattern	Point lookups, small updates	Full scans, aggregations
Schema	Normalized (3NF)	Denormalized (star/snowflake)
Optimization	Write throughput, latency	Read throughput, compression
Data volume	Current state	Historical accumulation

Fact and Dimension Tables#

Data warehouses organize data into fact tables and dimension tables.

Fact tables store measurable events — sales transactions, page views, shipments. Each row represents one event and contains:

Foreign keys to dimension tables
Numeric measures (revenue, quantity, duration)

Dimension tables store descriptive context — who, what, where, when. They contain:

A surrogate key (integer primary key)
Descriptive attributes (customer name, product category, store city)

fact_sales
  sale_id | date_key | product_key | store_key | quantity | revenue

dim_product
  product_key | product_name | category | brand | unit_price

dim_store
  store_key | store_name | city | state | region

dim_date
  date_key | full_date | day_of_week | month | quarter | year

Fact tables are typically narrow (few columns, many rows) and dimension tables are wide (many columns, fewer rows).

Star Schema vs Snowflake Schema#

Star schema places the fact table at the center with dimension tables radiating outward. Dimensions are fully denormalized — all attributes live in a single table per dimension.

        dim_date
           |
dim_store--fact_sales--dim_product
           |
       dim_customer

Snowflake schema normalizes dimensions into sub-dimensions. A dim_product table might reference a separate dim_category table, which references a dim_department table.

Trade-offs:

Aspect	Star	Snowflake
Query simplicity	Fewer joins	More joins
Storage	More redundancy	Less redundancy
Query performance	Faster (fewer joins)	Slower (more joins)
Maintenance	Easier	Harder to manage hierarchies

In practice, star schema is preferred for most data warehouses. The storage cost of denormalization is negligible compared to the query performance and simplicity gains.

ETL Pipelines#

ETL (Extract, Transform, Load) is the process of moving data from source systems into the warehouse.

Extract — Pull data from operational databases, APIs, log files, and third-party services. Use change data capture (CDC) for incremental extraction rather than full dumps.

Load — Write transformed data into the warehouse. Use bulk loading for efficiency. Append new facts; update dimensions using slowly changing dimension (SCD) strategies.

Sources -> Extract -> Staging Area -> Transform -> Load -> Warehouse
                                                          |
Sources -> Extract -> Load -> Warehouse -> Transform (ELT)

Tools in this space include Apache Airflow (orchestration), dbt (SQL transformations), Fivetran (managed extraction), and Spark (large-scale processing).

Columnar Storage#

Traditional row-oriented databases store all columns of a row together on disk. Columnar databases store all values of a single column together.

Why columnar wins for analytics:

Compression — Column values share the same data type and often repeat. Run-length encoding, dictionary encoding, and delta encoding achieve 5–10x compression.
I/O reduction — A query selecting 3 columns out of 100 reads only those 3 columns from disk, not the other 97.
Vectorized execution — Modern CPUs process arrays of same-type values much faster than heterogeneous row tuples.

Columnar formats include Apache Parquet (files), Apache ORC (Hive), and the internal formats of Redshift, BigQuery, and Snowflake.

MPP Engines#

Massively Parallel Processing (MPP) engines distribute query execution across many nodes. Each node processes a slice of the data, and results are merged.

Amazon Redshift#

Columnar, MPP, based on PostgreSQL dialect.
Data distributed across nodes using distribution keys (KEY, ALL, EVEN).
Sort keys determine on-disk ordering for range scans.
Spectrum extends queries to S3 data without loading it into the cluster.

Google BigQuery#

Serverless — no cluster management. Pay per query scanned.
Separates storage (Colossus) from compute (Dremel/Borg).
Automatic optimization: partitioning, clustering, and caching.
Slot-based execution model scales transparently.

Snowflake#

Separates storage, compute (virtual warehouses), and cloud services.
Multiple virtual warehouses can query the same data independently — no contention.
Automatic clustering, time travel (historical queries), and zero-copy cloning.
Multi-cloud: runs on AWS, GCP, and Azure.

Engine	Model	Scaling	Best For
Redshift	Provisioned cluster	Manual node add/remove	Predictable, steady workloads
BigQuery	Serverless	Automatic	Bursty, ad-hoc analytics
Snowflake	Separated compute	Independent warehouse scaling	Mixed workloads, multi-team

Slowly Changing Dimensions (SCD)#

Dimension attributes change over time. A customer moves to a new city; a product gets reclassified. SCD strategies determine how to handle these changes.

Type 1 — Overwrite: Replace the old value with the new one. No history preserved. Simple but loses context for historical analysis.

product_key | product_name | category    | valid_from | valid_to
1001        | Widget X     | Electronics | 2024-01-01 | 2025-06-30
1002        | Widget X     | IoT Devices | 2025-07-01 | 9999-12-31

Type 3 — Add new column: Store both old and new values in separate columns (current_category, previous_category). Limited to tracking one change.

Type 2 is the most common approach in production warehouses because it preserves full history without complicating the fact table.

Key Takeaways#

Data warehouses are optimized for analytical (OLAP) workloads — large scans, aggregations, and joins over historical data.
Star schema with denormalized dimensions is the standard pattern. Snowflake schema adds normalization at the cost of query complexity.
Fact tables store events with measures; dimension tables store descriptive context.
ETL pipelines extract from sources, transform to the warehouse schema, and load. Modern ELT pushes transformation into the warehouse engine.
Columnar storage delivers compression and I/O reduction that make analytical queries fast.
MPP engines (Redshift, BigQuery, Snowflake) distribute processing across nodes. Choose based on your scaling model and workload pattern.
Slowly changing dimensions (especially Type 2) preserve historical accuracy for time-variant analysis.

Build and explore system design concepts hands-on at codelit.io.

259 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Warehouse & Inventory Management

WMS with receiving, bin locations, pick/pack/ship, real-time stock tracking, and barcode scanning.

8 components

Build this architecture

Generate an interactive Data Warehouse Architecture in seconds.

Try it in Codelit →

Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines

OLTP vs OLAP#

Fact and Dimension Tables#

Star Schema vs Snowflake Schema#

ETL Pipelines#

Columnar Storage#

MPP Engines#

Amazon Redshift#

Google BigQuery#

Snowflake#

Slowly Changing Dimensions (SCD)#

Key Takeaways#

Comments

Related articles

Agentic Data Pipeline Workflow

Data Mesh Architecture: Domain Ownership, Data as Product & Federated Governance

Real-Time Analytics Architecture: From Events to Dashboards

Try these templates

Data Warehouse & Analytics

Warehouse & Inventory Management

Build this architecture

Data Warehouse Architecture: Star Schema, ETL & Modern MPP Engines

OLTP vs OLAP#

Fact and Dimension Tables#

Star Schema vs Snowflake Schema#

ETL Pipelines#

Columnar Storage#

MPP Engines#

Amazon Redshift#

Google BigQuery#

Snowflake#

Slowly Changing Dimensions (SCD)#

Key Takeaways#

Comments

Related articles

Agentic Data Pipeline Workflow

Data Mesh Architecture: Domain Ownership, Data as Product & Federated Governance

Real-Time Analytics Architecture: From Events to Dashboards

Try these templates

Data Warehouse & Analytics

Warehouse & Inventory Management

Build this architecture