data partitioningshardingdatabasedistributed systemssystem design

Data Partitioning & Sharding: A Practical Guide to Splitting Your Database

March 29, 2026 7 min readBy Codelit Team Discussion

Data Partitioning and Sharding#

When a single database can no longer handle your data volume or throughput, you split it. Data partitioning divides a dataset into smaller, more manageable pieces. Sharding is horizontal partitioning across multiple database instances. Done right, it unlocks near-unlimited scale. Done wrong, it creates a distributed nightmare.

Horizontal vs Vertical Partitioning#

Vertical Partitioning#

Split a table by columns — move groups of columns to separate tables or databases.

users table (before):
| id | name | email | bio | avatar_url | preferences_json |

users_core:
| id | name | email |

users_profile:
| id | bio | avatar_url | preferences_json |

When to use:

Some columns are accessed far more than others
Large columns (blobs, JSON) slow down frequent queries
Different access patterns for different column groups

Horizontal Partitioning (Sharding)#

Split a table by rows — distribute rows across multiple database instances.

Shard 1: users where id 1-1,000,000
Shard 2: users where id 1,000,001-2,000,000
Shard 3: users where id 2,000,001-3,000,000

When to use:

Single database cannot handle write throughput
Dataset exceeds single-machine storage
Geographic distribution of data is needed

Sharding Strategies#

Hash-Based Sharding#

Apply a hash function to the shard key and mod by shard count.

shard = hash(user_id) % num_shards

hash(1001) % 4 = 2  -&gt; Shard 2
hash(1002) % 4 = 0  -&gt; Shard 0
hash(1003) % 4 = 3  -&gt; Shard 3

Advantages:

Even data distribution (with a good hash function)
Simple to implement

Disadvantages:

Adding or removing shards requires rehashing (mitigated by consistent hashing)
Range queries across shards are expensive
No data locality

Range-Based Sharding#

Assign contiguous ranges of the shard key to each shard.

Shard 1: created_date 2024-01 to 2024-06
Shard 2: created_date 2024-07 to 2024-12
Shard 3: created_date 2025-01 to 2025-06

Advantages:

Range queries are efficient (hit one or few shards)
Easy to understand and implement
New data goes to the latest shard

Disadvantages:

Uneven distribution (recent shard gets all writes)
Hot spots on the active range
Requires periodic rebalancing

Directory-Based Sharding#

A lookup table maps each key (or key range) to a shard.

Lookup table:
| tenant_id | shard    |
|-----------|----------|
| acme      | shard-1  |
| globex    | shard-2  |
| initech   | shard-1  |
| umbrella  | shard-3  |

Advantages:

Maximum flexibility — move tenants between shards freely
No constraints on sharding logic
Supports custom placement (compliance, performance tiers)

Disadvantages:

Lookup table is a single point of failure
Extra hop for every query
Lookup table must be cached and kept consistent

Geographic Sharding#

Route data based on geographic location.

EU users    -&gt; eu-west-1 database
US users    -&gt; us-east-1 database
APAC users  -&gt; ap-southeast-1 database

Advantages:

Low latency for users (data is nearby)
Data residency compliance (GDPR, data sovereignty)
Regional failure isolation

Disadvantages:

Cross-region queries are slow and expensive
Users who travel or relocate create edge cases
Uneven shard sizes by region

Shard Key Selection#

The shard key determines everything. A bad shard key cannot be fixed without re-sharding (which is painful).

Criteria for a Good Shard Key#

Criteria	Why
High cardinality	Enough distinct values to distribute evenly
Even distribution	No single value dominates traffic
Query alignment	Most queries include the shard key
Immutability	Key does not change (avoids cross-shard moves)

Examples#

Use Case	Good Shard Key	Bad Shard Key
Multi-tenant SaaS	tenant_id	created_date (all tenants on one shard)
Social media	user_id	country (US shard overloaded)
E-commerce orders	order_id or customer_id	status (most orders are "completed")
IoT telemetry	device_id	timestamp (latest shard is hot)

Compound Shard Keys#

Combine fields for better distribution:

shard_key = hash(tenant_id + user_id)

This prevents a single large tenant from overloading one shard.

Cross-Shard Queries#

The biggest pain point. Queries that span shards are slow and complex.

-- This is fast (single shard):
SELECT * FROM orders WHERE customer_id = 42;

-- This is slow (scatter-gather across all shards):
SELECT COUNT(*) FROM orders WHERE status = 'pending';

Strategies#

Denormalize — duplicate data so queries stay within one shard
Scatter-gather — query all shards in parallel, merge results (acceptable for analytics)
Materialized views — pre-aggregate cross-shard data into a read-optimized store
CQRS — separate read and write models; reads from denormalized projections

Rebalancing#

As data grows, shards become uneven. Rebalancing moves data between shards.

Approaches#

Manual rebalancing: An operator decides when and how to move data. Maximum control, maximum effort.

Automatic rebalancing: The system detects imbalance and migrates data. Used by MongoDB, CockroachDB, Vitess.

Before:  Shard A (80 GB) | Shard B (20 GB)
Detect:  Imbalance ratio &gt; 2:1
Migrate: Move 30 GB of ranges from A to B
After:   Shard A (50 GB) | Shard B (50 GB)

Key concern: Rebalancing must happen without downtime. Techniques include:

Background data copying with change-data-capture
Double-write during migration
Atomic cutover once sync is confirmed

Hot Spots#

A hot spot occurs when one shard receives disproportionate traffic.

Common causes:

Celebrity user with millions of followers
Flash sale on one product
Time-based sharding where the current shard absorbs all writes

Mitigations:

Add a random suffix to hot keys to spread load: user_123_a, user_123_b
Use consistent hashing with virtual nodes for finer distribution
Move hot tenants to dedicated shards (directory-based)
Cache aggressively in front of the hot shard

Tools#

Vitess#

Originally built at YouTube to scale MySQL. Now a CNCF project.

Transparent sharding layer over MySQL
Handles shard routing, connection pooling, query rewriting
Supports resharding with zero downtime
Used by Slack, GitHub, HubSpot

Citus (PostgreSQL)#

Distributed PostgreSQL — sharding as an extension.

-- Distribute a table by tenant_id
SELECT create_distributed_table('orders', 'tenant_id');

-- Queries automatically route to the right shard
SELECT * FROM orders WHERE tenant_id = 'acme';

Columnar storage for analytics
Reference tables replicated to all nodes
Transparent to application code

Other Tools#

Tool	Database	Approach
ProxySQL	MySQL	Query routing and splitting
ShardingSphere	Any JDBC	Application-layer sharding
CockroachDB	Native	Auto-sharding, auto-rebalancing
YugabyteDB	Native	Distributed SQL with auto-sharding

Key Takeaways#

Vertical partitioning splits columns; horizontal partitioning (sharding) splits rows
Choose a shard key with high cardinality, even distribution, and query alignment
Hash sharding gives even distribution; range sharding enables efficient range queries
Cross-shard queries are expensive — design your schema to minimize them
Rebalancing is inevitable — pick tools that handle it without downtime
Start with read replicas and caching before reaching for sharding

Article #249 in the System Design series. Keep building: codelit.io/blog.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Build this architecture

Generate an interactive architecture for Data Partitioning & Sharding in seconds.

Try it in Codelit →

data partitioningshardingdatabasedistributed systemssystem design

Data Partitioning & Sharding: A Practical Guide to Splitting Your Database

March 29, 2026 7 min readBy Codelit Team Discussion

Data Partitioning and Sharding#

Horizontal vs Vertical Partitioning#

Vertical Partitioning#

Split a table by columns — move groups of columns to separate tables or databases.

users table (before):
| id | name | email | bio | avatar_url | preferences_json |

users_core:
| id | name | email |

users_profile:
| id | bio | avatar_url | preferences_json |

When to use:

Some columns are accessed far more than others
Large columns (blobs, JSON) slow down frequent queries
Different access patterns for different column groups

Horizontal Partitioning (Sharding)#

Split a table by rows — distribute rows across multiple database instances.

Shard 1: users where id 1-1,000,000
Shard 2: users where id 1,000,001-2,000,000
Shard 3: users where id 2,000,001-3,000,000

When to use:

Single database cannot handle write throughput
Dataset exceeds single-machine storage
Geographic distribution of data is needed

Sharding Strategies#

Hash-Based Sharding#

Apply a hash function to the shard key and mod by shard count.

shard = hash(user_id) % num_shards

hash(1001) % 4 = 2  -&gt; Shard 2
hash(1002) % 4 = 0  -&gt; Shard 0
hash(1003) % 4 = 3  -&gt; Shard 3

Advantages:

Even data distribution (with a good hash function)
Simple to implement

Disadvantages:

Adding or removing shards requires rehashing (mitigated by consistent hashing)
Range queries across shards are expensive
No data locality

Range-Based Sharding#

Assign contiguous ranges of the shard key to each shard.

Shard 1: created_date 2024-01 to 2024-06
Shard 2: created_date 2024-07 to 2024-12
Shard 3: created_date 2025-01 to 2025-06

Advantages:

Range queries are efficient (hit one or few shards)
Easy to understand and implement
New data goes to the latest shard

Disadvantages:

Uneven distribution (recent shard gets all writes)
Hot spots on the active range
Requires periodic rebalancing

Directory-Based Sharding#

A lookup table maps each key (or key range) to a shard.

Lookup table:
| tenant_id | shard    |
|-----------|----------|
| acme      | shard-1  |
| globex    | shard-2  |
| initech   | shard-1  |
| umbrella  | shard-3  |

Advantages:

Maximum flexibility — move tenants between shards freely
No constraints on sharding logic
Supports custom placement (compliance, performance tiers)

Disadvantages:

Lookup table is a single point of failure
Extra hop for every query
Lookup table must be cached and kept consistent

Geographic Sharding#

Route data based on geographic location.

EU users    -&gt; eu-west-1 database
US users    -&gt; us-east-1 database
APAC users  -&gt; ap-southeast-1 database

Advantages:

Low latency for users (data is nearby)
Data residency compliance (GDPR, data sovereignty)
Regional failure isolation

Disadvantages:

Cross-region queries are slow and expensive
Users who travel or relocate create edge cases
Uneven shard sizes by region

Shard Key Selection#

The shard key determines everything. A bad shard key cannot be fixed without re-sharding (which is painful).

Criteria for a Good Shard Key#

Criteria	Why
High cardinality	Enough distinct values to distribute evenly
Even distribution	No single value dominates traffic
Query alignment	Most queries include the shard key
Immutability	Key does not change (avoids cross-shard moves)

Examples#

Use Case	Good Shard Key	Bad Shard Key
Multi-tenant SaaS	tenant_id	created_date (all tenants on one shard)
Social media	user_id	country (US shard overloaded)
E-commerce orders	order_id or customer_id	status (most orders are "completed")
IoT telemetry	device_id	timestamp (latest shard is hot)

Compound Shard Keys#

Combine fields for better distribution:

shard_key = hash(tenant_id + user_id)

This prevents a single large tenant from overloading one shard.

Cross-Shard Queries#

The biggest pain point. Queries that span shards are slow and complex.

-- This is fast (single shard):
SELECT * FROM orders WHERE customer_id = 42;

-- This is slow (scatter-gather across all shards):
SELECT COUNT(*) FROM orders WHERE status = 'pending';

Strategies#

Denormalize — duplicate data so queries stay within one shard
Scatter-gather — query all shards in parallel, merge results (acceptable for analytics)
Materialized views — pre-aggregate cross-shard data into a read-optimized store
CQRS — separate read and write models; reads from denormalized projections

Rebalancing#

As data grows, shards become uneven. Rebalancing moves data between shards.

Approaches#

Manual rebalancing: An operator decides when and how to move data. Maximum control, maximum effort.

Automatic rebalancing: The system detects imbalance and migrates data. Used by MongoDB, CockroachDB, Vitess.

Before:  Shard A (80 GB) | Shard B (20 GB)
Detect:  Imbalance ratio &gt; 2:1
Migrate: Move 30 GB of ranges from A to B
After:   Shard A (50 GB) | Shard B (50 GB)

Key concern: Rebalancing must happen without downtime. Techniques include:

Background data copying with change-data-capture
Double-write during migration
Atomic cutover once sync is confirmed

Hot Spots#

A hot spot occurs when one shard receives disproportionate traffic.

Common causes:

Celebrity user with millions of followers
Flash sale on one product
Time-based sharding where the current shard absorbs all writes

Mitigations:

Add a random suffix to hot keys to spread load: user_123_a, user_123_b
Use consistent hashing with virtual nodes for finer distribution
Move hot tenants to dedicated shards (directory-based)
Cache aggressively in front of the hot shard

Tools#

Vitess#

Originally built at YouTube to scale MySQL. Now a CNCF project.

Transparent sharding layer over MySQL
Handles shard routing, connection pooling, query rewriting
Supports resharding with zero downtime
Used by Slack, GitHub, HubSpot

Citus (PostgreSQL)#

Distributed PostgreSQL — sharding as an extension.

-- Distribute a table by tenant_id
SELECT create_distributed_table('orders', 'tenant_id');

-- Queries automatically route to the right shard
SELECT * FROM orders WHERE tenant_id = 'acme';

Columnar storage for analytics
Reference tables replicated to all nodes
Transparent to application code

Other Tools#

Tool	Database	Approach
ProxySQL	MySQL	Query routing and splitting
ShardingSphere	Any JDBC	Application-layer sharding
CockroachDB	Native	Auto-sharding, auto-rebalancing
YugabyteDB	Native	Distributed SQL with auto-sharding

Key Takeaways#

Vertical partitioning splits columns; horizontal partitioning (sharding) splits rows
Choose a shard key with high cardinality, even distribution, and query alignment
Hash sharding gives even distribution; range sharding enables efficient range queries
Cross-shard queries are expensive — design your schema to minimize them
Rebalancing is inevitable — pick tools that handle it without downtime
Start with read replicas and caching before reaching for sharding

Article #249 in the System Design series. Keep building: codelit.io/blog.

Try it on Codelit

AI Architecture Review

Get an AI audit covering security gaps, bottlenecks, and scaling risks

Build this architecture →

Comments

AI search

Try these templates

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Build this architecture

Generate an interactive architecture for Data Partitioning & Sharding in seconds.

Try it in Codelit →

Data Partitioning & Sharding: A Practical Guide to Splitting Your Database

Data Partitioning and Sharding#

Horizontal vs Vertical Partitioning#

Vertical Partitioning#

Horizontal Partitioning (Sharding)#

Sharding Strategies#

Hash-Based Sharding#

Range-Based Sharding#

Directory-Based Sharding#

Geographic Sharding#

Shard Key Selection#

Criteria for a Good Shard Key#

Examples#

Compound Shard Keys#

Cross-Shard Queries#

Strategies#

Rebalancing#

Approaches#

Hot Spots#

Tools#

Vitess#

Citus (PostgreSQL)#

Other Tools#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Data Warehouse & Analytics

Build this architecture

Data Partitioning & Sharding: A Practical Guide to Splitting Your Database

Data Partitioning and Sharding#

Horizontal vs Vertical Partitioning#

Vertical Partitioning#

Horizontal Partitioning (Sharding)#

Sharding Strategies#

Hash-Based Sharding#

Range-Based Sharding#

Directory-Based Sharding#

Geographic Sharding#

Shard Key Selection#

Criteria for a Good Shard Key#

Examples#

Compound Shard Keys#

Cross-Shard Queries#

Strategies#

Rebalancing#

Approaches#

Hot Spots#

Tools#

Vitess#

Citus (PostgreSQL)#

Other Tools#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Data Warehouse & Analytics

Build this architecture