Top 30 System Design Best Practices: The Definitive Reference
System design interviews and production architectures share the same foundation: a set of principles that separate fragile systems from resilient ones. This article distills 30 best practices across seven categories — scalability, reliability, observability, security, data management, API design, and deployment — into a single reference you can revisit before any design session.
Scalability#
1. Design for Horizontal Scaling#
Scale out, not up. Stateless services behind a load balancer can add capacity linearly. Move session state to an external store (Redis, DynamoDB) so any instance can serve any request.
2. Partition Data Early#
Choose a partition key that distributes load evenly and aligns with your access patterns. Re-sharding a live system is one of the hardest operational tasks in distributed systems — get the key right from the start.
3. Use Caching at Every Layer#
Cache DNS responses, CDN edge content, application query results, and database query plans. Each layer reduces latency and offloads the layer below it. Invalidate explicitly when data changes rather than relying solely on TTLs.
4. Apply Back-Pressure#
When a service is overwhelmed, it should reject or slow down incoming requests rather than accepting more work than it can process. Back-pressure prevents cascading failures and preserves quality of service for accepted requests.
5. Prefer Asynchronous Communication#
Decouple producers from consumers with message queues or event streams. Async communication absorbs traffic spikes, enables independent scaling, and improves fault isolation.
Reliability#
6. Embrace Redundancy#
No single point of failure should exist in the critical path. Run at least two instances of every service, replicate databases across availability zones, and use multi-region DNS failover for global services.
7. Implement Circuit Breakers#
When a downstream dependency fails, a circuit breaker stops sending requests after a threshold of failures. This prevents thread pool exhaustion and gives the failing service time to recover.
8. Set Timeouts and Retries with Backoff#
Every network call needs a timeout. Retries should use exponential backoff with jitter to avoid thundering herds. Cap the total retry duration to prevent requests from hanging indefinitely.
9. Use Bulkheads to Isolate Failures#
Partition resources — thread pools, connection pools, rate limits — by tenant or by function. A misbehaving tenant or a slow endpoint should not consume resources needed by the rest of the system.
10. Practice Chaos Engineering#
Regularly inject failures in production (or staging) to verify that your redundancy, circuit breakers, and failover mechanisms actually work. Tools like Chaos Monkey, Litmus, and Gremlin automate this.
Observability#
11. Instrument the Three Pillars#
Collect metrics (counters, gauges, histograms), logs (structured, with correlation IDs), and traces (distributed, end-to-end). Each pillar answers a different question: metrics tell you something is wrong, logs tell you what happened, traces tell you where it happened.
12. Define SLIs, SLOs, and Error Budgets#
A Service Level Indicator (SLI) is a measurable metric (e.g., p99 latency). A Service Level Objective (SLO) is the target for that metric (e.g., p99 latency below 200 ms). The error budget is the allowed deviation. When the budget is exhausted, prioritize reliability over features.
13. Alert on Symptoms, Not Causes#
Alert when the user experience degrades (high error rate, elevated latency), not when an internal metric crosses a threshold (CPU at 80 %). Symptom-based alerting reduces noise and surfaces issues that actually matter.
14. Use Structured Logging#
Emit logs as JSON with consistent fields: timestamp, level, service, trace_id, message. Structured logs are searchable, filterable, and machine-parseable — unstructured strings are not.
15. Build Dashboards for Each Service#
Every service should have a dashboard showing the RED metrics (Rate, Errors, Duration) and the USE metrics (Utilization, Saturation, Errors) for its dependencies. Dashboards should answer "is this service healthy?" in under 10 seconds.
Security#
16. Apply the Principle of Least Privilege#
Every service, user, and process should have the minimum permissions required. Use short-lived credentials, scoped IAM roles, and just-in-time access grants. Broad permissions are the root cause of most privilege escalation incidents.
17. Encrypt Data in Transit and at Rest#
Use TLS 1.3 for all service-to-service communication. Encrypt data at rest using platform-managed keys (KMS) with automatic rotation. Never store secrets in code or environment variables — use a secrets manager.
18. Validate All Input#
Every input from an external source is untrusted. Validate type, length, format, and range on the server side regardless of client-side checks. Use parameterized queries to prevent SQL injection and sanitize output to prevent XSS.
19. Implement Rate Limiting and Throttling#
Protect APIs from abuse with rate limits per client, per endpoint, and globally. Use token bucket or sliding window algorithms. Return 429 Too Many Requests with a Retry-After header so well-behaved clients can back off.
20. Adopt Zero Trust Networking#
Do not rely on network perimeter security. Authenticate and authorize every request, even between internal services. Use mutual TLS (mTLS), service mesh identity, or JWT-based service-to-service authentication.
Data Management#
21. Choose the Right Database for the Access Pattern#
Relational databases excel at complex queries and transactions. Document stores handle flexible schemas. Key-value stores deliver sub-millisecond reads. Time-series databases optimize for append-heavy, time-ordered data. Match the storage engine to the workload.
22. Separate Reads from Writes (CQRS)#
When read and write patterns diverge significantly, split them. A write-optimized store handles commands while read-optimized projections (materialized views, search indices) serve queries. Event sourcing pairs naturally with CQRS.
23. Design for Eventual Consistency#
Strong consistency across distributed systems is expensive and slow. Accept eventual consistency where the business allows it, and use techniques like read-your-writes, causal consistency, or conflict-free replicated data types (CRDTs) where stronger guarantees are needed.
24. Implement Idempotent Operations#
Network retries and message redelivery mean operations will be executed more than once. Design writes to be idempotent using idempotency keys, conditional writes, or deduplication at the consumer.
25. Plan for Data Migration from Day One#
Schemas evolve. Use additive-only schema changes (add columns, never rename or remove in the same release). Version your data formats. Maintain backward compatibility for at least one release cycle.
API Design#
26. Version Your APIs#
Use URL path versioning (/v1/users) or header-based versioning. Never break existing clients with a deploy. Deprecate old versions with a published timeline and migration guide.
27. Use Pagination for List Endpoints#
Never return unbounded result sets. Use cursor-based pagination (opaque token pointing to the next page) rather than offset-based pagination (which degrades with large offsets). Include next_cursor and has_more in the response.
28. Design Idempotent APIs#
POST requests that create resources should accept an Idempotency-Key header. If the server receives a duplicate key, it returns the original response without creating a duplicate. This makes retries safe for clients.
Deployment#
29. Use Progressive Rollouts#
Deploy changes to a small percentage of traffic first (canary), monitor error rates and latency, then gradually increase. Feature flags decouple deployment from release, allowing you to ship code without exposing it to users until it is validated.
30. Automate Everything in the Deploy Pipeline#
From linting to testing to building to deploying — every step should be automated and reproducible. Infrastructure as code (Terraform, Pulumi) ensures environments are consistent. GitOps (ArgoCD, Flux) ensures the deployed state matches the declared state in version control.
Quick Reference Table#
| # | Practice | Category |
|---|---|---|
| 1 | Horizontal scaling | Scalability |
| 2 | Data partitioning | Scalability |
| 3 | Multi-layer caching | Scalability |
| 4 | Back-pressure | Scalability |
| 5 | Async communication | Scalability |
| 6 | Redundancy | Reliability |
| 7 | Circuit breakers | Reliability |
| 8 | Timeouts and retries | Reliability |
| 9 | Bulkheads | Reliability |
| 10 | Chaos engineering | Reliability |
| 11 | Three pillars of observability | Observability |
| 12 | SLIs, SLOs, error budgets | Observability |
| 13 | Symptom-based alerting | Observability |
| 14 | Structured logging | Observability |
| 15 | Service dashboards | Observability |
| 16 | Least privilege | Security |
| 17 | Encryption everywhere | Security |
| 18 | Input validation | Security |
| 19 | Rate limiting | Security |
| 20 | Zero trust networking | Security |
| 21 | Right database for the job | Data Management |
| 22 | CQRS | Data Management |
| 23 | Eventual consistency | Data Management |
| 24 | Idempotent operations | Data Management |
| 25 | Schema migration planning | Data Management |
| 26 | API versioning | API Design |
| 27 | Cursor-based pagination | API Design |
| 28 | Idempotent APIs | API Design |
| 29 | Progressive rollouts | Deployment |
| 30 | Automated deploy pipelines | Deployment |
Conclusion#
These 30 practices are not theoretical — they are battle-tested patterns used by engineering teams at every scale. No system needs all 30 on day one. Start with the basics (timeouts, retries, structured logging, least privilege), then layer in more sophisticated patterns (CQRS, chaos engineering, zero trust) as the system grows. Revisit this list before every design review to make sure you have not overlooked a critical dimension.
Article #318 of the Codelit system design series. Explore all articles at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for Top 30 System Design Best Practices in seconds.
Try it in Codelit →
Comments