data governancedata catalogdata lineagedata qualityPII detectionGDPRCCPAdata meshCollibraAtlanDataHubsystem design

Data Governance Architecture: Catalogs, Lineage, Quality & Compliance

March 29, 2026 8 min readBy Codelit Team Discussion

Every organization generates more data than it can manually track. Without governance, teams lose trust in their data, compliance violations go unnoticed, and duplicated pipelines waste engineering time. Data governance is the set of policies, processes, and architecture that ensures data is discoverable, accurate, secure, and compliant throughout its lifecycle.

Core Pillars of Data Governance#

┌────────────────────────────────────────────────────────┐
│                  Data Governance                        │
├──────────┬──────────┬───────────┬──────────┬──────────┤
│ Discovery│ Quality  │ Lineage   │ Access   │Compliance│
│ (Catalog)│ (Checks) │ (Tracking)│ (Control)│ (Privacy)│
└──────────┴──────────┴───────────┴──────────┴──────────┘

Each pillar reinforces the others. A catalog without lineage cannot answer "where did this number come from?" Lineage without quality checks cannot tell you if the number is correct. Quality without access control cannot prevent unauthorized changes.

Data Catalog#

A data catalog is the searchable inventory of every dataset, table, column, dashboard, and pipeline in the organization.

What a Catalog Contains#

Technical metadata — Schema definitions, column types, partitioning, storage location, freshness timestamps.
Business metadata — Human-readable descriptions, domain ownership, business glossary terms, classification tags.
Operational metadata — Pipeline run history, query frequency, top consumers, cost attribution.
Social metadata — User-generated annotations, questions, tribal knowledge captured in comments.

Catalog Architecture#

Most modern catalogs follow a pull-based ingestion model:

Connectors crawl source systems (warehouses, lakes, BI tools, orchestrators) on a schedule or via change events.
An ingestion service normalizes metadata into a unified graph model.
A search index (typically Elasticsearch or OpenSearch) powers full-text and faceted discovery.
A graph store (Neo4j, Neptune, or an in-house property graph) powers lineage traversal and impact analysis.

Data Lineage#

Lineage tracks how data flows from source to destination — across ingestion, transformation, aggregation, and presentation layers.

Lineage Granularity#

Level	Tracks	Example
Table-level	Which tables feed which tables	`raw.orders` feeds `analytics.daily_revenue`
Column-level	Which columns map to which columns	`raw.orders.amount` maps to `analytics.daily_revenue.total_amount`
Row-level	Which specific records were affected	Record ID 42 was included in the March 15 snapshot

Column-level lineage is the sweet spot for most organizations. It answers "if I change this column, what dashboards break?" without the storage overhead of row-level tracking.

Lineage Collection Methods#

SQL parsing — Static analysis of SQL queries to extract source and target relationships. Tools like sqllineage or sqlglot parse transformation logic without executing it.
Runtime instrumentation — Spark, Airflow, and dbt emit lineage events during execution. OpenLineage provides a standard specification for these events.
API-based extraction — BI tools (Tableau, Looker, Power BI) expose lineage through their metadata APIs.

Data Quality#

Governance without quality is governance on paper. Data quality frameworks continuously validate that data meets defined expectations.

Quality Dimensions#

Completeness — Are required fields populated? What percentage of rows have null values?
Accuracy — Do values match the real-world entity they represent?
Consistency — Do related datasets agree? Does the order count in the warehouse match the source system?
Timeliness — Is the data fresh enough for its intended use?
Uniqueness — Are there duplicate records that inflate counts?
Validity — Do values conform to expected formats, ranges, and referential constraints?

Implementing Quality Checks#

A quality framework typically operates in three layers:

Schema validation — Enforced at ingestion time. Rejects records that violate type or nullability constraints.
Statistical profiling — Runs after ingestion. Detects anomalies in distributions, cardinality, and freshness.
Business rules — Domain-specific assertions (e.g., "revenue must be non-negative", "every order must reference a valid customer").

Tools like Great Expectations, dbt tests, Soda, and Monte Carlo automate these checks and surface failures in dashboards and alerts.

Access Control#

Data governance must enforce who can see, modify, and share data.

Access Control Models#

Role-Based Access Control (RBAC) — Permissions are assigned to roles, and users are assigned to roles. Simple but coarse-grained.
Attribute-Based Access Control (ABAC) — Policies evaluate attributes of the user, the resource, and the environment. Fine-grained but complex to manage.
Tag-Based Access Control — Columns or tables are tagged (e.g., PII, confidential), and policies reference tags. Scales well with catalogs that auto-classify data.

Column-Level and Row-Level Security#

Modern warehouses (Snowflake, BigQuery, Databricks) support:

Column masking — Sensitive columns are dynamically masked or tokenized based on the querying user's role.
Row-level security — Policies filter rows so users only see data they are authorized to access (e.g., a regional manager sees only their region).

PII Detection and Classification#

Personally Identifiable Information must be identified before it can be governed.

Detection Approaches#

Pattern matching — Regex for emails, phone numbers, SSNs, credit card numbers. Fast but limited to known formats.
Named Entity Recognition (NER) — ML models detect names, addresses, and other entities in free-text fields.
Metadata heuristics — Column names like email, ssn, phone are strong signals.
Sampling and profiling — Statistical analysis of column values to detect high-cardinality string fields that may contain PII.

Best practice: combine all four approaches and surface results in the data catalog with confidence scores. Human stewards review and confirm classifications.

Privacy regulations impose specific obligations on how personal data is collected, stored, processed, and deleted.

Key Requirements#

Requirement	GDPR	CCPA
Right to access	Data subjects can request a copy of their data	Consumers can request disclosure of collected data
Right to deletion	"Right to be forgotten" — erase data upon request	Consumers can request deletion of personal information
Data portability	Provide data in a machine-readable format	Not explicitly required
Consent management	Explicit opt-in consent required	Opt-out model (consumers can opt out of sale)
Breach notification	72-hour notification to authorities	"Reasonable" timeframe

Architectural Patterns for Compliance#

Data inventory — Maintain a catalog of all personal data, its purpose, legal basis, and retention period.
Consent ledger — An immutable log of consent grants and revocations, linked to data processing activities.
Automated deletion pipelines — When a deletion request arrives, propagate it through every system that holds the subject's data. Lineage tracking identifies all downstream copies.
Pseudonymization — Replace direct identifiers with tokens. Store the mapping in a separate, tightly controlled vault.

Data Mesh Governance#

In a data mesh, domain teams own their data products. Governance must be federated — each domain enforces local policies — but interoperable — a global standard ensures discoverability and compatibility.

Federated Governance Model#

┌─────────────────────────────────────────────┐
│         Global Governance Council             │
│  (Standards, Glossary, Compliance Policies)   │
├─────────────┬─────────────┬─────────────────┤
│  Domain A   │  Domain B   │   Domain C       │
│  Data Owner │  Data Owner │   Data Owner     │
│  Local      │  Local      │   Local          │
│  Stewards   │  Stewards   │   Stewards       │
└─────────────┴─────────────┴─────────────────┘

The global council defines interoperability standards: naming conventions, SLA definitions, classification taxonomies, and compliance baselines.
Domain teams implement those standards within their data products and are accountable for quality, documentation, and access policies.
A self-serve data platform provides shared tooling — catalog, lineage, quality checks, access control — so domains do not reinvent the wheel.

Tools Landscape#

Collibra#

Enterprise-grade governance platform. Strengths: business glossary, policy management, workflow automation, regulatory compliance modules. Best for large organizations with dedicated data governance teams.

Atlan#

Modern, developer-friendly catalog. Strengths: active metadata, embedded collaboration (Slack-like threads on assets), integrations with dbt, Airflow, Snowflake, and Looker. Best for data teams that want governance without heavy process overhead.

DataHub (LinkedIn Open Source)#

Open-source metadata platform. Strengths: extensible metadata model, real-time ingestion via Kafka, GraphQL API, column-level lineage. Best for engineering-driven organizations that want full control and extensibility.

Comparison#

Capability	Collibra	Atlan	DataHub
Data catalog	Yes	Yes	Yes
Column-level lineage	Yes	Yes	Yes
Business glossary	Strong	Good	Basic
Data quality integration	Native	Via partners	Via plugins
Access control policies	Native	Tag-based	Plugin-based
Deployment	SaaS	SaaS	Self-hosted / Acryl Cloud
Pricing	Enterprise	Mid-market	Free (OSS)

Building a Governance Program#

Start with discovery — Deploy a catalog and ingest metadata from your top 5 data sources.
Classify sensitive data — Run PII detection and tag columns in the catalog.
Define ownership — Assign a data owner and steward to every critical dataset.
Automate quality — Add quality checks to your transformation pipelines (dbt tests, Great Expectations).
Enable lineage — Instrument your orchestrator and warehouse to emit lineage events.
Enforce access — Implement tag-based access control tied to catalog classifications.
Monitor and iterate — Track governance adoption metrics: catalog coverage, quality score trends, access review completion rates.

Conclusion#

Data governance is not a one-time project. It is an ongoing practice that grows with your data estate. Start with a catalog, add lineage and quality, enforce access and compliance, and let the architecture evolve as your organization's data maturity increases. The tools exist — the challenge is building the culture and processes to use them consistently.

Article #317 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Build this architecture

Generate an interactive Data Governance Architecture in seconds.

Try it in Codelit →

data governancedata catalogdata lineagedata qualityPII detectionGDPRCCPAdata meshCollibraAtlanDataHubsystem design

Data Governance Architecture: Catalogs, Lineage, Quality & Compliance

March 29, 2026 8 min readBy Codelit Team Discussion

Core Pillars of Data Governance#

┌────────────────────────────────────────────────────────┐
│                  Data Governance                        │
├──────────┬──────────┬───────────┬──────────┬──────────┤
│ Discovery│ Quality  │ Lineage   │ Access   │Compliance│
│ (Catalog)│ (Checks) │ (Tracking)│ (Control)│ (Privacy)│
└──────────┴──────────┴───────────┴──────────┴──────────┘

Data Catalog#

A data catalog is the searchable inventory of every dataset, table, column, dashboard, and pipeline in the organization.

What a Catalog Contains#

Technical metadata — Schema definitions, column types, partitioning, storage location, freshness timestamps.
Business metadata — Human-readable descriptions, domain ownership, business glossary terms, classification tags.
Operational metadata — Pipeline run history, query frequency, top consumers, cost attribution.
Social metadata — User-generated annotations, questions, tribal knowledge captured in comments.

Catalog Architecture#

Most modern catalogs follow a pull-based ingestion model:

Connectors crawl source systems (warehouses, lakes, BI tools, orchestrators) on a schedule or via change events.
An ingestion service normalizes metadata into a unified graph model.
A search index (typically Elasticsearch or OpenSearch) powers full-text and faceted discovery.
A graph store (Neo4j, Neptune, or an in-house property graph) powers lineage traversal and impact analysis.

Data Lineage#

Lineage tracks how data flows from source to destination — across ingestion, transformation, aggregation, and presentation layers.

Lineage Granularity#

Level	Tracks	Example
Table-level	Which tables feed which tables	`raw.orders` feeds `analytics.daily_revenue`
Column-level	Which columns map to which columns	`raw.orders.amount` maps to `analytics.daily_revenue.total_amount`
Row-level	Which specific records were affected	Record ID 42 was included in the March 15 snapshot

Column-level lineage is the sweet spot for most organizations. It answers "if I change this column, what dashboards break?" without the storage overhead of row-level tracking.

Lineage Collection Methods#

SQL parsing — Static analysis of SQL queries to extract source and target relationships. Tools like sqllineage or sqlglot parse transformation logic without executing it.
Runtime instrumentation — Spark, Airflow, and dbt emit lineage events during execution. OpenLineage provides a standard specification for these events.
API-based extraction — BI tools (Tableau, Looker, Power BI) expose lineage through their metadata APIs.

Data Quality#

Governance without quality is governance on paper. Data quality frameworks continuously validate that data meets defined expectations.

Quality Dimensions#

Completeness — Are required fields populated? What percentage of rows have null values?
Accuracy — Do values match the real-world entity they represent?
Consistency — Do related datasets agree? Does the order count in the warehouse match the source system?
Timeliness — Is the data fresh enough for its intended use?
Uniqueness — Are there duplicate records that inflate counts?
Validity — Do values conform to expected formats, ranges, and referential constraints?

Implementing Quality Checks#

A quality framework typically operates in three layers:

Schema validation — Enforced at ingestion time. Rejects records that violate type or nullability constraints.
Statistical profiling — Runs after ingestion. Detects anomalies in distributions, cardinality, and freshness.
Business rules — Domain-specific assertions (e.g., "revenue must be non-negative", "every order must reference a valid customer").

Tools like Great Expectations, dbt tests, Soda, and Monte Carlo automate these checks and surface failures in dashboards and alerts.

Access Control#

Data governance must enforce who can see, modify, and share data.

Access Control Models#

Role-Based Access Control (RBAC) — Permissions are assigned to roles, and users are assigned to roles. Simple but coarse-grained.
Attribute-Based Access Control (ABAC) — Policies evaluate attributes of the user, the resource, and the environment. Fine-grained but complex to manage.
Tag-Based Access Control — Columns or tables are tagged (e.g., PII, confidential), and policies reference tags. Scales well with catalogs that auto-classify data.

Column-Level and Row-Level Security#

Modern warehouses (Snowflake, BigQuery, Databricks) support:

Column masking — Sensitive columns are dynamically masked or tokenized based on the querying user's role.
Row-level security — Policies filter rows so users only see data they are authorized to access (e.g., a regional manager sees only their region).

PII Detection and Classification#

Personally Identifiable Information must be identified before it can be governed.

Detection Approaches#

Pattern matching — Regex for emails, phone numbers, SSNs, credit card numbers. Fast but limited to known formats.
Named Entity Recognition (NER) — ML models detect names, addresses, and other entities in free-text fields.
Metadata heuristics — Column names like email, ssn, phone are strong signals.
Sampling and profiling — Statistical analysis of column values to detect high-cardinality string fields that may contain PII.

Best practice: combine all four approaches and surface results in the data catalog with confidence scores. Human stewards review and confirm classifications.

Privacy regulations impose specific obligations on how personal data is collected, stored, processed, and deleted.

Key Requirements#

Requirement	GDPR	CCPA
Right to access	Data subjects can request a copy of their data	Consumers can request disclosure of collected data
Right to deletion	"Right to be forgotten" — erase data upon request	Consumers can request deletion of personal information
Data portability	Provide data in a machine-readable format	Not explicitly required
Consent management	Explicit opt-in consent required	Opt-out model (consumers can opt out of sale)
Breach notification	72-hour notification to authorities	"Reasonable" timeframe

Architectural Patterns for Compliance#

Data inventory — Maintain a catalog of all personal data, its purpose, legal basis, and retention period.
Consent ledger — An immutable log of consent grants and revocations, linked to data processing activities.
Automated deletion pipelines — When a deletion request arrives, propagate it through every system that holds the subject's data. Lineage tracking identifies all downstream copies.
Pseudonymization — Replace direct identifiers with tokens. Store the mapping in a separate, tightly controlled vault.

Data Mesh Governance#

Federated Governance Model#

┌─────────────────────────────────────────────┐
│         Global Governance Council             │
│  (Standards, Glossary, Compliance Policies)   │
├─────────────┬─────────────┬─────────────────┤
│  Domain A   │  Domain B   │   Domain C       │
│  Data Owner │  Data Owner │   Data Owner     │
│  Local      │  Local      │   Local          │
│  Stewards   │  Stewards   │   Stewards       │
└─────────────┴─────────────┴─────────────────┘

The global council defines interoperability standards: naming conventions, SLA definitions, classification taxonomies, and compliance baselines.
Domain teams implement those standards within their data products and are accountable for quality, documentation, and access policies.
A self-serve data platform provides shared tooling — catalog, lineage, quality checks, access control — so domains do not reinvent the wheel.

Tools Landscape#

Collibra#

Atlan#

DataHub (LinkedIn Open Source)#

Comparison#

Capability	Collibra	Atlan	DataHub
Data catalog	Yes	Yes	Yes
Column-level lineage	Yes	Yes	Yes
Business glossary	Strong	Good	Basic
Data quality integration	Native	Via partners	Via plugins
Access control policies	Native	Tag-based	Plugin-based
Deployment	SaaS	SaaS	Self-hosted / Acryl Cloud
Pricing	Enterprise	Mid-market	Free (OSS)

Building a Governance Program#

Start with discovery — Deploy a catalog and ingest metadata from your top 5 data sources.
Classify sensitive data — Run PII detection and tag columns in the catalog.
Define ownership — Assign a data owner and steward to every critical dataset.
Automate quality — Add quality checks to your transformation pipelines (dbt tests, Great Expectations).
Enable lineage — Instrument your orchestrator and warehouse to emit lineage events.
Enforce access — Implement tag-based access control tied to catalog classifications.
Monitor and iterate — Track governance adoption metrics: catalog coverage, quality score trends, access review completion rates.

Conclusion#

Article #317 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Build this architecture

Generate an interactive Data Governance Architecture in seconds.

Try it in Codelit →

Data Governance Architecture: Catalogs, Lineage, Quality & Compliance

Core Pillars of Data Governance#

Data Catalog#

What a Catalog Contains#

Catalog Architecture#

Data Lineage#

Lineage Granularity#

Lineage Collection Methods#

Data Quality#

Quality Dimensions#

Implementing Quality Checks#

Access Control#

Access Control Models#

Column-Level and Row-Level Security#

PII Detection and Classification#

Detection Approaches#

GDPR and CCPA Compliance#

Key Requirements#

Architectural Patterns for Compliance#

Data Mesh Governance#

Federated Governance Model#

Tools Landscape#

Collibra#

Atlan#

DataHub (LinkedIn Open Source)#

Comparison#

Building a Governance Program#

Conclusion#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Data Warehouse & Analytics

Build this architecture

Data Governance Architecture: Catalogs, Lineage, Quality & Compliance

Core Pillars of Data Governance#

Data Catalog#

What a Catalog Contains#

Catalog Architecture#

Data Lineage#

Lineage Granularity#

Lineage Collection Methods#

Data Quality#

Quality Dimensions#

Implementing Quality Checks#

Access Control#

Access Control Models#

Column-Level and Row-Level Security#

PII Detection and Classification#

Detection Approaches#

GDPR and CCPA Compliance#

Key Requirements#

Architectural Patterns for Compliance#

Data Mesh Governance#

Federated Governance Model#

Tools Landscape#

Collibra#

Atlan#

DataHub (LinkedIn Open Source)#

Comparison#

Building a Governance Program#

Conclusion#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Data Warehouse & Analytics

Build this architecture