Data Partitioning & Sharding: A Practical Guide to Splitting Your Database
Data Partitioning and Sharding#
When a single database can no longer handle your data volume or throughput, you split it. Data partitioning divides a dataset into smaller, more manageable pieces. Sharding is horizontal partitioning across multiple database instances. Done right, it unlocks near-unlimited scale. Done wrong, it creates a distributed nightmare.
Horizontal vs Vertical Partitioning#
Vertical Partitioning#
Split a table by columns — move groups of columns to separate tables or databases.
users table (before):
| id | name | email | bio | avatar_url | preferences_json |
users_core:
| id | name | email |
users_profile:
| id | bio | avatar_url | preferences_json |
When to use:
- Some columns are accessed far more than others
- Large columns (blobs, JSON) slow down frequent queries
- Different access patterns for different column groups
Horizontal Partitioning (Sharding)#
Split a table by rows — distribute rows across multiple database instances.
Shard 1: users where id 1-1,000,000
Shard 2: users where id 1,000,001-2,000,000
Shard 3: users where id 2,000,001-3,000,000
When to use:
- Single database cannot handle write throughput
- Dataset exceeds single-machine storage
- Geographic distribution of data is needed
Sharding Strategies#
Hash-Based Sharding#
Apply a hash function to the shard key and mod by shard count.
shard = hash(user_id) % num_shards
hash(1001) % 4 = 2 -> Shard 2
hash(1002) % 4 = 0 -> Shard 0
hash(1003) % 4 = 3 -> Shard 3
Advantages:
- Even data distribution (with a good hash function)
- Simple to implement
Disadvantages:
- Adding or removing shards requires rehashing (mitigated by consistent hashing)
- Range queries across shards are expensive
- No data locality
Range-Based Sharding#
Assign contiguous ranges of the shard key to each shard.
Shard 1: created_date 2024-01 to 2024-06
Shard 2: created_date 2024-07 to 2024-12
Shard 3: created_date 2025-01 to 2025-06
Advantages:
- Range queries are efficient (hit one or few shards)
- Easy to understand and implement
- New data goes to the latest shard
Disadvantages:
- Uneven distribution (recent shard gets all writes)
- Hot spots on the active range
- Requires periodic rebalancing
Directory-Based Sharding#
A lookup table maps each key (or key range) to a shard.
Lookup table:
| tenant_id | shard |
|-----------|----------|
| acme | shard-1 |
| globex | shard-2 |
| initech | shard-1 |
| umbrella | shard-3 |
Advantages:
- Maximum flexibility — move tenants between shards freely
- No constraints on sharding logic
- Supports custom placement (compliance, performance tiers)
Disadvantages:
- Lookup table is a single point of failure
- Extra hop for every query
- Lookup table must be cached and kept consistent
Geographic Sharding#
Route data based on geographic location.
EU users -> eu-west-1 database
US users -> us-east-1 database
APAC users -> ap-southeast-1 database
Advantages:
- Low latency for users (data is nearby)
- Data residency compliance (GDPR, data sovereignty)
- Regional failure isolation
Disadvantages:
- Cross-region queries are slow and expensive
- Users who travel or relocate create edge cases
- Uneven shard sizes by region
Shard Key Selection#
The shard key determines everything. A bad shard key cannot be fixed without re-sharding (which is painful).
Criteria for a Good Shard Key#
| Criteria | Why |
|---|---|
| High cardinality | Enough distinct values to distribute evenly |
| Even distribution | No single value dominates traffic |
| Query alignment | Most queries include the shard key |
| Immutability | Key does not change (avoids cross-shard moves) |
Examples#
| Use Case | Good Shard Key | Bad Shard Key |
|---|---|---|
| Multi-tenant SaaS | tenant_id | created_date (all tenants on one shard) |
| Social media | user_id | country (US shard overloaded) |
| E-commerce orders | order_id or customer_id | status (most orders are "completed") |
| IoT telemetry | device_id | timestamp (latest shard is hot) |
Compound Shard Keys#
Combine fields for better distribution:
shard_key = hash(tenant_id + user_id)
This prevents a single large tenant from overloading one shard.
Cross-Shard Queries#
The biggest pain point. Queries that span shards are slow and complex.
-- This is fast (single shard):
SELECT * FROM orders WHERE customer_id = 42;
-- This is slow (scatter-gather across all shards):
SELECT COUNT(*) FROM orders WHERE status = 'pending';
Strategies#
- Denormalize — duplicate data so queries stay within one shard
- Scatter-gather — query all shards in parallel, merge results (acceptable for analytics)
- Materialized views — pre-aggregate cross-shard data into a read-optimized store
- CQRS — separate read and write models; reads from denormalized projections
Rebalancing#
As data grows, shards become uneven. Rebalancing moves data between shards.
Approaches#
Manual rebalancing: An operator decides when and how to move data. Maximum control, maximum effort.
Automatic rebalancing: The system detects imbalance and migrates data. Used by MongoDB, CockroachDB, Vitess.
Before: Shard A (80 GB) | Shard B (20 GB)
Detect: Imbalance ratio > 2:1
Migrate: Move 30 GB of ranges from A to B
After: Shard A (50 GB) | Shard B (50 GB)
Key concern: Rebalancing must happen without downtime. Techniques include:
- Background data copying with change-data-capture
- Double-write during migration
- Atomic cutover once sync is confirmed
Hot Spots#
A hot spot occurs when one shard receives disproportionate traffic.
Common causes:
- Celebrity user with millions of followers
- Flash sale on one product
- Time-based sharding where the current shard absorbs all writes
Mitigations:
- Add a random suffix to hot keys to spread load:
user_123_a,user_123_b - Use consistent hashing with virtual nodes for finer distribution
- Move hot tenants to dedicated shards (directory-based)
- Cache aggressively in front of the hot shard
Tools#
Vitess#
Originally built at YouTube to scale MySQL. Now a CNCF project.
- Transparent sharding layer over MySQL
- Handles shard routing, connection pooling, query rewriting
- Supports resharding with zero downtime
- Used by Slack, GitHub, HubSpot
Citus (PostgreSQL)#
Distributed PostgreSQL — sharding as an extension.
-- Distribute a table by tenant_id
SELECT create_distributed_table('orders', 'tenant_id');
-- Queries automatically route to the right shard
SELECT * FROM orders WHERE tenant_id = 'acme';
- Columnar storage for analytics
- Reference tables replicated to all nodes
- Transparent to application code
Other Tools#
| Tool | Database | Approach |
|---|---|---|
| ProxySQL | MySQL | Query routing and splitting |
| ShardingSphere | Any JDBC | Application-layer sharding |
| CockroachDB | Native | Auto-sharding, auto-rebalancing |
| YugabyteDB | Native | Distributed SQL with auto-sharding |
Key Takeaways#
- Vertical partitioning splits columns; horizontal partitioning (sharding) splits rows
- Choose a shard key with high cardinality, even distribution, and query alignment
- Hash sharding gives even distribution; range sharding enables efficient range queries
- Cross-shard queries are expensive — design your schema to minimize them
- Rebalancing is inevitable — pick tools that handle it without downtime
- Start with read replicas and caching before reaching for sharding
Article #249 in the System Design series. Keep building: codelit.io/blog.
Try it on Codelit
AI Architecture Review
Get an AI audit covering security gaps, bottlenecks, and scaling risks
Related articles
Try these templates
Build this architecture
Generate an interactive architecture for Data Partitioning & Sharding in seconds.
Try it in Codelit →
Comments