data lakedata engineeringarchitecturebig dataanalytics

Data Lake Architecture: From Raw Ingestion to Production-Ready Analytics

March 28, 2026 7 min readBy Codelit Team Discussion

Data Lake Architecture#

A data lake stores raw data at any scale in its native format. Unlike a data warehouse, it does not enforce schema at write time — you decide how to interpret data when you read it. This flexibility is powerful and dangerous in equal measure.

Data Lake vs Data Warehouse vs Lakehouse#

	Data Lake	Data Warehouse	Lakehouse
Schema	On read	On write	On read + enforced
Data types	Structured, semi, unstructured	Structured only	All types
Storage cost	Low (object storage)	High (proprietary)	Low
Query performance	Variable	Optimized	Optimized
ACID transactions	No (without table format)	Yes	Yes
Governance	Manual	Built-in	Built-in
Examples	S3 + Athena	Snowflake, BigQuery	Databricks, Iceberg on S3

The lakehouse combines cheap lake storage with warehouse-grade reliability. It is where the industry is heading.

Zone Architecture#

A well-structured data lake uses zones to separate data by maturity:

Raw Zone (Landing / Bronze)#

Exact copy of source data — no transformations
Immutable: never modify raw data, only append
Formats: JSON, CSV, XML, Avro — whatever the source produces
Retention: keep forever (storage is cheap, re-ingestion is expensive)

Curated Zone (Cleaned / Silver)#

Deduplicated, validated, typed, and conformed
Schema enforced, nulls handled, data types standardized
Joins across sources happen here
Common format: Parquet or Delta Lake tables

Consumption Zone (Gold)#

Business-ready aggregations, metrics, and feature tables
Optimized for query performance (pre-joined, pre-aggregated)
Feeds dashboards, ML models, and downstream APIs
Access controlled per team or use case

Source → [Raw Zone] → [Curated Zone] → [Consumption Zone] → BI / ML / API
           Bronze          Silver              Gold

Schema-on-Read#

Traditional databases enforce schema at write time — if data does not match the schema, the write fails. Data lakes flip this: store first, interpret later.

Advantages:

Ingest any data immediately without upfront modeling
Multiple teams can apply different schemas to the same data
Raw data is preserved for future use cases you have not imagined yet

Risks:

"Data swamp" — without governance, nobody knows what data means
Query failures at read time instead of write time
Schema evolution is your responsibility, not the storage engine's

Mitigation: pair schema-on-read with a data catalog and table formats that track schema evolution.

File Formats#

Choosing the right format has a massive impact on cost and performance.

Apache Parquet#

Columnar storage — reads only the columns you query
Excellent compression (Snappy, Zstd, Gzip)
The default choice for analytical workloads
Supported everywhere: Spark, Athena, BigQuery, Snowflake, DuckDB

Apache ORC#

Columnar, optimized for Hive workloads
Built-in lightweight indexes (min/max, bloom filters)
Slightly better compression than Parquet in some benchmarks
Strong in the Hadoop/Hive ecosystem, less universal than Parquet

Apache Avro#

Row-based — better for write-heavy and streaming workloads
Self-describing: schema embedded in the file
Ideal for the raw zone and Kafka consumers
Schema evolution is a first-class feature (add/remove fields safely)

Rule of thumb: Avro for ingestion (raw zone), Parquet for analytics (curated and consumption zones).

Partitioning Strategies#

Partitioning controls how data is physically organized on disk. Good partitioning means queries skip irrelevant files entirely (partition pruning).

Common strategies:

# Time-based (most common)
s3://lake/orders/year=2026/month=03/day=28/

# Category-based
s3://lake/events/region=eu/event_type=purchase/

# Composite
s3://lake/logs/year=2026/month=03/service=auth/

Guidelines:

Partition by your most common filter column (usually date)
Aim for 100 MB - 1 GB per partition for Parquet files
Too many small partitions = "small file problem" (metadata overhead kills performance)
Too few large partitions = full scans on every query
Use hidden partitioning (Iceberg) to decouple partition layout from query syntax

Data Catalog#

A data catalog is the index that prevents your lake from becoming a swamp.

AWS Glue Data Catalog#

Automatic schema crawling from S3
Integrates with Athena, Redshift Spectrum, EMR
Hive-compatible metastore

Databricks Unity Catalog#

Unified governance across workspaces
Fine-grained access control (row/column level)
Data lineage tracking built in
Works with Delta Lake natively

Apache Hive Metastore#

Open-source, widely supported
Foundation for Glue, Trino, Spark SQL
Tracks table schemas, partitions, and locations

A catalog should answer: What data exists? Where is it? Who owns it? What does each field mean?

Governance#

Data governance is not optional at scale:

Access control — who can read/write each zone and table
Encryption — at rest (SSE-S3, SSE-KMS) and in transit (TLS)
Audit logging — every read and write tracked (CloudTrail, Unity Catalog audit)
Data lineage — trace any metric back to its raw source
Quality checks — automated validation between zones (Great Expectations, dbt tests)
Retention policies — PII expiration, GDPR right-to-deletion compliance
Classification — tag columns as PII, financial, internal, public

Open Table Formats#

Table formats add warehouse capabilities to lake storage:

Delta Lake#

Created by Databricks, open-sourced
ACID transactions on Parquet files
Time travel (query data as of any past version)
Schema enforcement and evolution
OPTIMIZE and Z-ORDER for compaction and data skipping
Deep Spark integration

Apache Iceberg#

Created at Netflix, now an Apache project
Hidden partitioning — partition evolution without rewriting data
Snapshot isolation for concurrent reads and writes
Engine-agnostic: works with Spark, Trino, Flink, Dremio, Snowflake
Growing fastest in adoption (2025-2026)

Apache Hudi#

Created at Uber for incremental data processing
Excels at upserts and CDC (change data capture) workloads
Record-level indexing for fast point lookups
Near real-time ingestion pipelines

Iceberg is the safest bet for new projects due to broad engine support and active development. Delta Lake is the right choice if you are committed to the Databricks ecosystem.

Medallion Architecture (Bronze / Silver / Gold)#

The medallion architecture formalizes the zone pattern with clear transformation contracts:

Bronze (Raw)

Source data as-is, append-only
Metadata columns added: ingestion timestamp, source system, batch ID
No business logic

Silver (Cleaned)

Deduplication, null handling, type casting
Conformed keys (e.g., all customer IDs normalized)
Slowly changing dimensions applied
Data quality checks gate promotion from bronze

Gold (Business)

Star schema or wide denormalized tables
Pre-computed aggregations and KPIs
One gold table per business domain or use case
SLA-backed freshness guarantees

Bronze → dbt/Spark jobs → Silver → dbt/Spark jobs → Gold
  ↓                         ↓                         ↓
Raw audit trail      Conformed entities      Dashboard-ready

Each layer has its own:

Schema contracts (enforced by table format)
Data quality tests (Great Expectations, dbt tests, Soda)
Access policies (broader access at gold, restricted at bronze)
Retention rules

Quick Reference#

Decision	Recommendation
File format (raw)	Avro
File format (analytics)	Parquet
Table format	Apache Iceberg (or Delta Lake on Databricks)
Partition column	Date/time as primary partition
Partition size target	100 MB - 1 GB
Catalog	Unity Catalog (Databricks) or Glue (AWS)
Quality framework	Great Expectations or dbt tests
Architecture pattern	Medallion (bronze/silver/gold)

A data lake without architecture is a data swamp. Zones, catalogs, table formats, and governance transform cheap object storage into a reliable analytics platform.

Build data-driven products at codelit.io.

Article #176 in the Codelit engineering series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Real-Time Analytics Dashboard

Live analytics platform with event ingestion, stream processing, and interactive dashboards.

8 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive Data Lake Architecture in seconds.

Try it in Codelit →

data lakedata engineeringarchitecturebig dataanalytics

Data Lake Architecture: From Raw Ingestion to Production-Ready Analytics

March 28, 2026 7 min readBy Codelit Team Discussion

Data Lake Architecture#

Data Lake vs Data Warehouse vs Lakehouse#

	Data Lake	Data Warehouse	Lakehouse
Schema	On read	On write	On read + enforced
Data types	Structured, semi, unstructured	Structured only	All types
Storage cost	Low (object storage)	High (proprietary)	Low
Query performance	Variable	Optimized	Optimized
ACID transactions	No (without table format)	Yes	Yes
Governance	Manual	Built-in	Built-in
Examples	S3 + Athena	Snowflake, BigQuery	Databricks, Iceberg on S3

The lakehouse combines cheap lake storage with warehouse-grade reliability. It is where the industry is heading.

Zone Architecture#

A well-structured data lake uses zones to separate data by maturity:

Raw Zone (Landing / Bronze)#

Exact copy of source data — no transformations
Immutable: never modify raw data, only append
Formats: JSON, CSV, XML, Avro — whatever the source produces
Retention: keep forever (storage is cheap, re-ingestion is expensive)

Curated Zone (Cleaned / Silver)#

Deduplicated, validated, typed, and conformed
Schema enforced, nulls handled, data types standardized
Joins across sources happen here
Common format: Parquet or Delta Lake tables

Consumption Zone (Gold)#

Business-ready aggregations, metrics, and feature tables
Optimized for query performance (pre-joined, pre-aggregated)
Feeds dashboards, ML models, and downstream APIs
Access controlled per team or use case

Source → [Raw Zone] → [Curated Zone] → [Consumption Zone] → BI / ML / API
           Bronze          Silver              Gold

Schema-on-Read#

Traditional databases enforce schema at write time — if data does not match the schema, the write fails. Data lakes flip this: store first, interpret later.

Advantages:

Ingest any data immediately without upfront modeling
Multiple teams can apply different schemas to the same data
Raw data is preserved for future use cases you have not imagined yet

Risks:

"Data swamp" — without governance, nobody knows what data means
Query failures at read time instead of write time
Schema evolution is your responsibility, not the storage engine's

Mitigation: pair schema-on-read with a data catalog and table formats that track schema evolution.

File Formats#

Choosing the right format has a massive impact on cost and performance.

Apache Parquet#

Columnar storage — reads only the columns you query
Excellent compression (Snappy, Zstd, Gzip)
The default choice for analytical workloads
Supported everywhere: Spark, Athena, BigQuery, Snowflake, DuckDB

Apache ORC#

Columnar, optimized for Hive workloads
Built-in lightweight indexes (min/max, bloom filters)
Slightly better compression than Parquet in some benchmarks
Strong in the Hadoop/Hive ecosystem, less universal than Parquet

Apache Avro#

Row-based — better for write-heavy and streaming workloads
Self-describing: schema embedded in the file
Ideal for the raw zone and Kafka consumers
Schema evolution is a first-class feature (add/remove fields safely)

Rule of thumb: Avro for ingestion (raw zone), Parquet for analytics (curated and consumption zones).

Partitioning Strategies#

Partitioning controls how data is physically organized on disk. Good partitioning means queries skip irrelevant files entirely (partition pruning).

Common strategies:

# Time-based (most common)
s3://lake/orders/year=2026/month=03/day=28/

# Category-based
s3://lake/events/region=eu/event_type=purchase/

# Composite
s3://lake/logs/year=2026/month=03/service=auth/

Guidelines:

Partition by your most common filter column (usually date)
Aim for 100 MB - 1 GB per partition for Parquet files
Too many small partitions = "small file problem" (metadata overhead kills performance)
Too few large partitions = full scans on every query
Use hidden partitioning (Iceberg) to decouple partition layout from query syntax

Data Catalog#

A data catalog is the index that prevents your lake from becoming a swamp.

AWS Glue Data Catalog#

Automatic schema crawling from S3
Integrates with Athena, Redshift Spectrum, EMR
Hive-compatible metastore

Databricks Unity Catalog#

Unified governance across workspaces
Fine-grained access control (row/column level)
Data lineage tracking built in
Works with Delta Lake natively

Apache Hive Metastore#

Open-source, widely supported
Foundation for Glue, Trino, Spark SQL
Tracks table schemas, partitions, and locations

A catalog should answer: What data exists? Where is it? Who owns it? What does each field mean?

Governance#

Data governance is not optional at scale:

Access control — who can read/write each zone and table
Encryption — at rest (SSE-S3, SSE-KMS) and in transit (TLS)
Audit logging — every read and write tracked (CloudTrail, Unity Catalog audit)
Data lineage — trace any metric back to its raw source
Quality checks — automated validation between zones (Great Expectations, dbt tests)
Retention policies — PII expiration, GDPR right-to-deletion compliance
Classification — tag columns as PII, financial, internal, public

Open Table Formats#

Table formats add warehouse capabilities to lake storage:

Delta Lake#

Created by Databricks, open-sourced
ACID transactions on Parquet files
Time travel (query data as of any past version)
Schema enforcement and evolution
OPTIMIZE and Z-ORDER for compaction and data skipping
Deep Spark integration

Apache Iceberg#

Created at Netflix, now an Apache project
Hidden partitioning — partition evolution without rewriting data
Snapshot isolation for concurrent reads and writes
Engine-agnostic: works with Spark, Trino, Flink, Dremio, Snowflake
Growing fastest in adoption (2025-2026)

Apache Hudi#

Created at Uber for incremental data processing
Excels at upserts and CDC (change data capture) workloads
Record-level indexing for fast point lookups
Near real-time ingestion pipelines

Iceberg is the safest bet for new projects due to broad engine support and active development. Delta Lake is the right choice if you are committed to the Databricks ecosystem.

Medallion Architecture (Bronze / Silver / Gold)#

The medallion architecture formalizes the zone pattern with clear transformation contracts:

Bronze (Raw)

Source data as-is, append-only
Metadata columns added: ingestion timestamp, source system, batch ID
No business logic

Silver (Cleaned)

Deduplication, null handling, type casting
Conformed keys (e.g., all customer IDs normalized)
Slowly changing dimensions applied
Data quality checks gate promotion from bronze

Gold (Business)

Star schema or wide denormalized tables
Pre-computed aggregations and KPIs
One gold table per business domain or use case
SLA-backed freshness guarantees

Bronze → dbt/Spark jobs → Silver → dbt/Spark jobs → Gold
  ↓                         ↓                         ↓
Raw audit trail      Conformed entities      Dashboard-ready

Each layer has its own:

Schema contracts (enforced by table format)
Data quality tests (Great Expectations, dbt tests, Soda)
Access policies (broader access at gold, restricted at bronze)
Retention rules

Quick Reference#

Decision	Recommendation
File format (raw)	Avro
File format (analytics)	Parquet
Table format	Apache Iceberg (or Delta Lake on Databricks)
Partition column	Date/time as primary partition
Partition size target	100 MB - 1 GB
Catalog	Unity Catalog (Databricks) or Glue (AWS)
Quality framework	Great Expectations or dbt tests
Architecture pattern	Medallion (bronze/silver/gold)

A data lake without architecture is a data swamp. Zones, catalogs, table formats, and governance transform cheap object storage into a reliable analytics platform.

Build data-driven products at codelit.io.

Article #176 in the Codelit engineering series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive Data Lake Architecture in seconds.

Try it in Codelit →

Data Lake Architecture: From Raw Ingestion to Production-Ready Analytics

Data Lake Architecture#

Data Lake vs Data Warehouse vs Lakehouse#

Zone Architecture#

Raw Zone (Landing / Bronze)#

Curated Zone (Cleaned / Silver)#

Consumption Zone (Gold)#

Schema-on-Read#

File Formats#

Apache Parquet#

Apache ORC#

Apache Avro#

Partitioning Strategies#

Data Catalog#

AWS Glue Data Catalog#

Databricks Unity Catalog#

Apache Hive Metastore#

Governance#

Open Table Formats#

Delta Lake#

Apache Iceberg#

Apache Hudi#

Medallion Architecture (Bronze / Silver / Gold)#

Quick Reference#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Real-Time Analytics Dashboard

Search Engine Architecture

Build this architecture

Data Lake Architecture: From Raw Ingestion to Production-Ready Analytics

Data Lake Architecture#

Data Lake vs Data Warehouse vs Lakehouse#

Zone Architecture#

Raw Zone (Landing / Bronze)#

Curated Zone (Cleaned / Silver)#

Consumption Zone (Gold)#

Schema-on-Read#

File Formats#

Apache Parquet#

Apache ORC#

Apache Avro#

Partitioning Strategies#

Data Catalog#

AWS Glue Data Catalog#

Databricks Unity Catalog#

Apache Hive Metastore#

Governance#

Open Table Formats#

Delta Lake#

Apache Iceberg#

Apache Hudi#

Medallion Architecture (Bronze / Silver / Gold)#

Quick Reference#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Real-Time Analytics Dashboard

Search Engine Architecture

Build this architecture