Data Lake Architecture: From Raw Ingestion to Production-Ready Analytics
Data Lake Architecture#
A data lake stores raw data at any scale in its native format. Unlike a data warehouse, it does not enforce schema at write time — you decide how to interpret data when you read it. This flexibility is powerful and dangerous in equal measure.
Data Lake vs Data Warehouse vs Lakehouse#
| Data Lake | Data Warehouse | Lakehouse | |
|---|---|---|---|
| Schema | On read | On write | On read + enforced |
| Data types | Structured, semi, unstructured | Structured only | All types |
| Storage cost | Low (object storage) | High (proprietary) | Low |
| Query performance | Variable | Optimized | Optimized |
| ACID transactions | No (without table format) | Yes | Yes |
| Governance | Manual | Built-in | Built-in |
| Examples | S3 + Athena | Snowflake, BigQuery | Databricks, Iceberg on S3 |
The lakehouse combines cheap lake storage with warehouse-grade reliability. It is where the industry is heading.
Zone Architecture#
A well-structured data lake uses zones to separate data by maturity:
Raw Zone (Landing / Bronze)#
- Exact copy of source data — no transformations
- Immutable: never modify raw data, only append
- Formats: JSON, CSV, XML, Avro — whatever the source produces
- Retention: keep forever (storage is cheap, re-ingestion is expensive)
Curated Zone (Cleaned / Silver)#
- Deduplicated, validated, typed, and conformed
- Schema enforced, nulls handled, data types standardized
- Joins across sources happen here
- Common format: Parquet or Delta Lake tables
Consumption Zone (Gold)#
- Business-ready aggregations, metrics, and feature tables
- Optimized for query performance (pre-joined, pre-aggregated)
- Feeds dashboards, ML models, and downstream APIs
- Access controlled per team or use case
Source → [Raw Zone] → [Curated Zone] → [Consumption Zone] → BI / ML / API
Bronze Silver Gold
Schema-on-Read#
Traditional databases enforce schema at write time — if data does not match the schema, the write fails. Data lakes flip this: store first, interpret later.
Advantages:
- Ingest any data immediately without upfront modeling
- Multiple teams can apply different schemas to the same data
- Raw data is preserved for future use cases you have not imagined yet
Risks:
- "Data swamp" — without governance, nobody knows what data means
- Query failures at read time instead of write time
- Schema evolution is your responsibility, not the storage engine's
Mitigation: pair schema-on-read with a data catalog and table formats that track schema evolution.
File Formats#
Choosing the right format has a massive impact on cost and performance.
Apache Parquet#
- Columnar storage — reads only the columns you query
- Excellent compression (Snappy, Zstd, Gzip)
- The default choice for analytical workloads
- Supported everywhere: Spark, Athena, BigQuery, Snowflake, DuckDB
Apache ORC#
- Columnar, optimized for Hive workloads
- Built-in lightweight indexes (min/max, bloom filters)
- Slightly better compression than Parquet in some benchmarks
- Strong in the Hadoop/Hive ecosystem, less universal than Parquet
Apache Avro#
- Row-based — better for write-heavy and streaming workloads
- Self-describing: schema embedded in the file
- Ideal for the raw zone and Kafka consumers
- Schema evolution is a first-class feature (add/remove fields safely)
Rule of thumb: Avro for ingestion (raw zone), Parquet for analytics (curated and consumption zones).
Partitioning Strategies#
Partitioning controls how data is physically organized on disk. Good partitioning means queries skip irrelevant files entirely (partition pruning).
Common strategies:
# Time-based (most common)
s3://lake/orders/year=2026/month=03/day=28/
# Category-based
s3://lake/events/region=eu/event_type=purchase/
# Composite
s3://lake/logs/year=2026/month=03/service=auth/
Guidelines:
- Partition by your most common filter column (usually date)
- Aim for 100 MB - 1 GB per partition for Parquet files
- Too many small partitions = "small file problem" (metadata overhead kills performance)
- Too few large partitions = full scans on every query
- Use hidden partitioning (Iceberg) to decouple partition layout from query syntax
Data Catalog#
A data catalog is the index that prevents your lake from becoming a swamp.
AWS Glue Data Catalog#
- Automatic schema crawling from S3
- Integrates with Athena, Redshift Spectrum, EMR
- Hive-compatible metastore
Databricks Unity Catalog#
- Unified governance across workspaces
- Fine-grained access control (row/column level)
- Data lineage tracking built in
- Works with Delta Lake natively
Apache Hive Metastore#
- Open-source, widely supported
- Foundation for Glue, Trino, Spark SQL
- Tracks table schemas, partitions, and locations
A catalog should answer: What data exists? Where is it? Who owns it? What does each field mean?
Governance#
Data governance is not optional at scale:
- Access control — who can read/write each zone and table
- Encryption — at rest (SSE-S3, SSE-KMS) and in transit (TLS)
- Audit logging — every read and write tracked (CloudTrail, Unity Catalog audit)
- Data lineage — trace any metric back to its raw source
- Quality checks — automated validation between zones (Great Expectations, dbt tests)
- Retention policies — PII expiration, GDPR right-to-deletion compliance
- Classification — tag columns as PII, financial, internal, public
Open Table Formats#
Table formats add warehouse capabilities to lake storage:
Delta Lake#
- Created by Databricks, open-sourced
- ACID transactions on Parquet files
- Time travel (query data as of any past version)
- Schema enforcement and evolution
- OPTIMIZE and Z-ORDER for compaction and data skipping
- Deep Spark integration
Apache Iceberg#
- Created at Netflix, now an Apache project
- Hidden partitioning — partition evolution without rewriting data
- Snapshot isolation for concurrent reads and writes
- Engine-agnostic: works with Spark, Trino, Flink, Dremio, Snowflake
- Growing fastest in adoption (2025-2026)
Apache Hudi#
- Created at Uber for incremental data processing
- Excels at upserts and CDC (change data capture) workloads
- Record-level indexing for fast point lookups
- Near real-time ingestion pipelines
Iceberg is the safest bet for new projects due to broad engine support and active development. Delta Lake is the right choice if you are committed to the Databricks ecosystem.
Medallion Architecture (Bronze / Silver / Gold)#
The medallion architecture formalizes the zone pattern with clear transformation contracts:
Bronze (Raw)
- Source data as-is, append-only
- Metadata columns added: ingestion timestamp, source system, batch ID
- No business logic
Silver (Cleaned)
- Deduplication, null handling, type casting
- Conformed keys (e.g., all customer IDs normalized)
- Slowly changing dimensions applied
- Data quality checks gate promotion from bronze
Gold (Business)
- Star schema or wide denormalized tables
- Pre-computed aggregations and KPIs
- One gold table per business domain or use case
- SLA-backed freshness guarantees
Bronze → dbt/Spark jobs → Silver → dbt/Spark jobs → Gold
↓ ↓ ↓
Raw audit trail Conformed entities Dashboard-ready
Each layer has its own:
- Schema contracts (enforced by table format)
- Data quality tests (Great Expectations, dbt tests, Soda)
- Access policies (broader access at gold, restricted at bronze)
- Retention rules
Quick Reference#
| Decision | Recommendation |
|---|---|
| File format (raw) | Avro |
| File format (analytics) | Parquet |
| Table format | Apache Iceberg (or Delta Lake on Databricks) |
| Partition column | Date/time as primary partition |
| Partition size target | 100 MB - 1 GB |
| Catalog | Unity Catalog (Databricks) or Glue (AWS) |
| Quality framework | Great Expectations or dbt tests |
| Architecture pattern | Medallion (bronze/silver/gold) |
A data lake without architecture is a data swamp. Zones, catalogs, table formats, and governance transform cheap object storage into a reliable analytics platform.
Build data-driven products at codelit.io.
Article #176 in the Codelit engineering series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsReal-Time Analytics Dashboard
Live analytics platform with event ingestion, stream processing, and interactive dashboards.
8 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive Data Lake Architecture in seconds.
Try it in Codelit →
Comments