Data Governance Architecture: Catalogs, Lineage, Quality & Compliance
Every organization generates more data than it can manually track. Without governance, teams lose trust in their data, compliance violations go unnoticed, and duplicated pipelines waste engineering time. Data governance is the set of policies, processes, and architecture that ensures data is discoverable, accurate, secure, and compliant throughout its lifecycle.
Core Pillars of Data Governance#
┌────────────────────────────────────────────────────────┐
│ Data Governance │
├──────────┬──────────┬───────────┬──────────┬──────────┤
│ Discovery│ Quality │ Lineage │ Access │Compliance│
│ (Catalog)│ (Checks) │ (Tracking)│ (Control)│ (Privacy)│
└──────────┴──────────┴───────────┴──────────┴──────────┘
Each pillar reinforces the others. A catalog without lineage cannot answer "where did this number come from?" Lineage without quality checks cannot tell you if the number is correct. Quality without access control cannot prevent unauthorized changes.
Data Catalog#
A data catalog is the searchable inventory of every dataset, table, column, dashboard, and pipeline in the organization.
What a Catalog Contains#
- Technical metadata — Schema definitions, column types, partitioning, storage location, freshness timestamps.
- Business metadata — Human-readable descriptions, domain ownership, business glossary terms, classification tags.
- Operational metadata — Pipeline run history, query frequency, top consumers, cost attribution.
- Social metadata — User-generated annotations, questions, tribal knowledge captured in comments.
Catalog Architecture#
Most modern catalogs follow a pull-based ingestion model:
- Connectors crawl source systems (warehouses, lakes, BI tools, orchestrators) on a schedule or via change events.
- An ingestion service normalizes metadata into a unified graph model.
- A search index (typically Elasticsearch or OpenSearch) powers full-text and faceted discovery.
- A graph store (Neo4j, Neptune, or an in-house property graph) powers lineage traversal and impact analysis.
Data Lineage#
Lineage tracks how data flows from source to destination — across ingestion, transformation, aggregation, and presentation layers.
Lineage Granularity#
| Level | Tracks | Example |
|---|---|---|
| Table-level | Which tables feed which tables | raw.orders feeds analytics.daily_revenue |
| Column-level | Which columns map to which columns | raw.orders.amount maps to analytics.daily_revenue.total_amount |
| Row-level | Which specific records were affected | Record ID 42 was included in the March 15 snapshot |
Column-level lineage is the sweet spot for most organizations. It answers "if I change this column, what dashboards break?" without the storage overhead of row-level tracking.
Lineage Collection Methods#
- SQL parsing — Static analysis of SQL queries to extract source and target relationships. Tools like sqllineage or sqlglot parse transformation logic without executing it.
- Runtime instrumentation — Spark, Airflow, and dbt emit lineage events during execution. OpenLineage provides a standard specification for these events.
- API-based extraction — BI tools (Tableau, Looker, Power BI) expose lineage through their metadata APIs.
Data Quality#
Governance without quality is governance on paper. Data quality frameworks continuously validate that data meets defined expectations.
Quality Dimensions#
- Completeness — Are required fields populated? What percentage of rows have null values?
- Accuracy — Do values match the real-world entity they represent?
- Consistency — Do related datasets agree? Does the order count in the warehouse match the source system?
- Timeliness — Is the data fresh enough for its intended use?
- Uniqueness — Are there duplicate records that inflate counts?
- Validity — Do values conform to expected formats, ranges, and referential constraints?
Implementing Quality Checks#
A quality framework typically operates in three layers:
- Schema validation — Enforced at ingestion time. Rejects records that violate type or nullability constraints.
- Statistical profiling — Runs after ingestion. Detects anomalies in distributions, cardinality, and freshness.
- Business rules — Domain-specific assertions (e.g., "revenue must be non-negative", "every order must reference a valid customer").
Tools like Great Expectations, dbt tests, Soda, and Monte Carlo automate these checks and surface failures in dashboards and alerts.
Access Control#
Data governance must enforce who can see, modify, and share data.
Access Control Models#
- Role-Based Access Control (RBAC) — Permissions are assigned to roles, and users are assigned to roles. Simple but coarse-grained.
- Attribute-Based Access Control (ABAC) — Policies evaluate attributes of the user, the resource, and the environment. Fine-grained but complex to manage.
- Tag-Based Access Control — Columns or tables are tagged (e.g.,
PII,confidential), and policies reference tags. Scales well with catalogs that auto-classify data.
Column-Level and Row-Level Security#
Modern warehouses (Snowflake, BigQuery, Databricks) support:
- Column masking — Sensitive columns are dynamically masked or tokenized based on the querying user's role.
- Row-level security — Policies filter rows so users only see data they are authorized to access (e.g., a regional manager sees only their region).
PII Detection and Classification#
Personally Identifiable Information must be identified before it can be governed.
Detection Approaches#
- Pattern matching — Regex for emails, phone numbers, SSNs, credit card numbers. Fast but limited to known formats.
- Named Entity Recognition (NER) — ML models detect names, addresses, and other entities in free-text fields.
- Metadata heuristics — Column names like
email,ssn,phoneare strong signals. - Sampling and profiling — Statistical analysis of column values to detect high-cardinality string fields that may contain PII.
Best practice: combine all four approaches and surface results in the data catalog with confidence scores. Human stewards review and confirm classifications.
GDPR and CCPA Compliance#
Privacy regulations impose specific obligations on how personal data is collected, stored, processed, and deleted.
Key Requirements#
| Requirement | GDPR | CCPA |
|---|---|---|
| Right to access | Data subjects can request a copy of their data | Consumers can request disclosure of collected data |
| Right to deletion | "Right to be forgotten" — erase data upon request | Consumers can request deletion of personal information |
| Data portability | Provide data in a machine-readable format | Not explicitly required |
| Consent management | Explicit opt-in consent required | Opt-out model (consumers can opt out of sale) |
| Breach notification | 72-hour notification to authorities | "Reasonable" timeframe |
Architectural Patterns for Compliance#
- Data inventory — Maintain a catalog of all personal data, its purpose, legal basis, and retention period.
- Consent ledger — An immutable log of consent grants and revocations, linked to data processing activities.
- Automated deletion pipelines — When a deletion request arrives, propagate it through every system that holds the subject's data. Lineage tracking identifies all downstream copies.
- Pseudonymization — Replace direct identifiers with tokens. Store the mapping in a separate, tightly controlled vault.
Data Mesh Governance#
In a data mesh, domain teams own their data products. Governance must be federated — each domain enforces local policies — but interoperable — a global standard ensures discoverability and compatibility.
Federated Governance Model#
┌─────────────────────────────────────────────┐
│ Global Governance Council │
│ (Standards, Glossary, Compliance Policies) │
├─────────────┬─────────────┬─────────────────┤
│ Domain A │ Domain B │ Domain C │
│ Data Owner │ Data Owner │ Data Owner │
│ Local │ Local │ Local │
│ Stewards │ Stewards │ Stewards │
└─────────────┴─────────────┴─────────────────┘
- The global council defines interoperability standards: naming conventions, SLA definitions, classification taxonomies, and compliance baselines.
- Domain teams implement those standards within their data products and are accountable for quality, documentation, and access policies.
- A self-serve data platform provides shared tooling — catalog, lineage, quality checks, access control — so domains do not reinvent the wheel.
Tools Landscape#
Collibra#
Enterprise-grade governance platform. Strengths: business glossary, policy management, workflow automation, regulatory compliance modules. Best for large organizations with dedicated data governance teams.
Atlan#
Modern, developer-friendly catalog. Strengths: active metadata, embedded collaboration (Slack-like threads on assets), integrations with dbt, Airflow, Snowflake, and Looker. Best for data teams that want governance without heavy process overhead.
DataHub (LinkedIn Open Source)#
Open-source metadata platform. Strengths: extensible metadata model, real-time ingestion via Kafka, GraphQL API, column-level lineage. Best for engineering-driven organizations that want full control and extensibility.
Comparison#
| Capability | Collibra | Atlan | DataHub |
|---|---|---|---|
| Data catalog | Yes | Yes | Yes |
| Column-level lineage | Yes | Yes | Yes |
| Business glossary | Strong | Good | Basic |
| Data quality integration | Native | Via partners | Via plugins |
| Access control policies | Native | Tag-based | Plugin-based |
| Deployment | SaaS | SaaS | Self-hosted / Acryl Cloud |
| Pricing | Enterprise | Mid-market | Free (OSS) |
Building a Governance Program#
- Start with discovery — Deploy a catalog and ingest metadata from your top 5 data sources.
- Classify sensitive data — Run PII detection and tag columns in the catalog.
- Define ownership — Assign a data owner and steward to every critical dataset.
- Automate quality — Add quality checks to your transformation pipelines (dbt tests, Great Expectations).
- Enable lineage — Instrument your orchestrator and warehouse to emit lineage events.
- Enforce access — Implement tag-based access control tied to catalog classifications.
- Monitor and iterate — Track governance adoption metrics: catalog coverage, quality score trends, access review completion rates.
Conclusion#
Data governance is not a one-time project. It is an ongoing practice that grows with your data estate. Start with a catalog, add lineage and quality, enforce access and compliance, and let the architecture evolve as your organization's data maturity increases. The tools exist — the challenge is building the culture and processes to use them consistently.
Article #317 of the Codelit system design series. Explore all articles at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive Data Governance Architecture in seconds.
Try it in Codelit →
Comments