system design best practicesscalabilityreliabilityobservabilitysecurityAPI designdata managementdeploymentsystem design

Top 30 System Design Best Practices: The Definitive Reference

March 29, 2026 8 min readBy Codelit Team Discussion

System design interviews and production architectures share the same foundation: a set of principles that separate fragile systems from resilient ones. This article distills 30 best practices across seven categories — scalability, reliability, observability, security, data management, API design, and deployment — into a single reference you can revisit before any design session.

Scalability#

1. Design for Horizontal Scaling#

Scale out, not up. Stateless services behind a load balancer can add capacity linearly. Move session state to an external store (Redis, DynamoDB) so any instance can serve any request.

2. Partition Data Early#

Choose a partition key that distributes load evenly and aligns with your access patterns. Re-sharding a live system is one of the hardest operational tasks in distributed systems — get the key right from the start.

3. Use Caching at Every Layer#

Cache DNS responses, CDN edge content, application query results, and database query plans. Each layer reduces latency and offloads the layer below it. Invalidate explicitly when data changes rather than relying solely on TTLs.

4. Apply Back-Pressure#

When a service is overwhelmed, it should reject or slow down incoming requests rather than accepting more work than it can process. Back-pressure prevents cascading failures and preserves quality of service for accepted requests.

5. Prefer Asynchronous Communication#

Decouple producers from consumers with message queues or event streams. Async communication absorbs traffic spikes, enables independent scaling, and improves fault isolation.

Reliability#

6. Embrace Redundancy#

No single point of failure should exist in the critical path. Run at least two instances of every service, replicate databases across availability zones, and use multi-region DNS failover for global services.

7. Implement Circuit Breakers#

When a downstream dependency fails, a circuit breaker stops sending requests after a threshold of failures. This prevents thread pool exhaustion and gives the failing service time to recover.

8. Set Timeouts and Retries with Backoff#

Every network call needs a timeout. Retries should use exponential backoff with jitter to avoid thundering herds. Cap the total retry duration to prevent requests from hanging indefinitely.

9. Use Bulkheads to Isolate Failures#

Partition resources — thread pools, connection pools, rate limits — by tenant or by function. A misbehaving tenant or a slow endpoint should not consume resources needed by the rest of the system.

10. Practice Chaos Engineering#

Regularly inject failures in production (or staging) to verify that your redundancy, circuit breakers, and failover mechanisms actually work. Tools like Chaos Monkey, Litmus, and Gremlin automate this.

Observability#

11. Instrument the Three Pillars#

Collect metrics (counters, gauges, histograms), logs (structured, with correlation IDs), and traces (distributed, end-to-end). Each pillar answers a different question: metrics tell you something is wrong, logs tell you what happened, traces tell you where it happened.

12. Define SLIs, SLOs, and Error Budgets#

A Service Level Indicator (SLI) is a measurable metric (e.g., p99 latency). A Service Level Objective (SLO) is the target for that metric (e.g., p99 latency below 200 ms). The error budget is the allowed deviation. When the budget is exhausted, prioritize reliability over features.

13. Alert on Symptoms, Not Causes#

Alert when the user experience degrades (high error rate, elevated latency), not when an internal metric crosses a threshold (CPU at 80 %). Symptom-based alerting reduces noise and surfaces issues that actually matter.

14. Use Structured Logging#

Emit logs as JSON with consistent fields: timestamp, level, service, trace_id, message. Structured logs are searchable, filterable, and machine-parseable — unstructured strings are not.

15. Build Dashboards for Each Service#

Every service should have a dashboard showing the RED metrics (Rate, Errors, Duration) and the USE metrics (Utilization, Saturation, Errors) for its dependencies. Dashboards should answer "is this service healthy?" in under 10 seconds.

Security#

16. Apply the Principle of Least Privilege#

Every service, user, and process should have the minimum permissions required. Use short-lived credentials, scoped IAM roles, and just-in-time access grants. Broad permissions are the root cause of most privilege escalation incidents.

17. Encrypt Data in Transit and at Rest#

Use TLS 1.3 for all service-to-service communication. Encrypt data at rest using platform-managed keys (KMS) with automatic rotation. Never store secrets in code or environment variables — use a secrets manager.

18. Validate All Input#

Every input from an external source is untrusted. Validate type, length, format, and range on the server side regardless of client-side checks. Use parameterized queries to prevent SQL injection and sanitize output to prevent XSS.

19. Implement Rate Limiting and Throttling#

Protect APIs from abuse with rate limits per client, per endpoint, and globally. Use token bucket or sliding window algorithms. Return 429 Too Many Requests with a Retry-After header so well-behaved clients can back off.

20. Adopt Zero Trust Networking#

Do not rely on network perimeter security. Authenticate and authorize every request, even between internal services. Use mutual TLS (mTLS), service mesh identity, or JWT-based service-to-service authentication.

Data Management#

21. Choose the Right Database for the Access Pattern#

Relational databases excel at complex queries and transactions. Document stores handle flexible schemas. Key-value stores deliver sub-millisecond reads. Time-series databases optimize for append-heavy, time-ordered data. Match the storage engine to the workload.

22. Separate Reads from Writes (CQRS)#

When read and write patterns diverge significantly, split them. A write-optimized store handles commands while read-optimized projections (materialized views, search indices) serve queries. Event sourcing pairs naturally with CQRS.

23. Design for Eventual Consistency#

Strong consistency across distributed systems is expensive and slow. Accept eventual consistency where the business allows it, and use techniques like read-your-writes, causal consistency, or conflict-free replicated data types (CRDTs) where stronger guarantees are needed.

24. Implement Idempotent Operations#

Network retries and message redelivery mean operations will be executed more than once. Design writes to be idempotent using idempotency keys, conditional writes, or deduplication at the consumer.

25. Plan for Data Migration from Day One#

Schemas evolve. Use additive-only schema changes (add columns, never rename or remove in the same release). Version your data formats. Maintain backward compatibility for at least one release cycle.

API Design#

26. Version Your APIs#

Use URL path versioning (/v1/users) or header-based versioning. Never break existing clients with a deploy. Deprecate old versions with a published timeline and migration guide.

27. Use Pagination for List Endpoints#

Never return unbounded result sets. Use cursor-based pagination (opaque token pointing to the next page) rather than offset-based pagination (which degrades with large offsets). Include next_cursor and has_more in the response.

28. Design Idempotent APIs#

POST requests that create resources should accept an Idempotency-Key header. If the server receives a duplicate key, it returns the original response without creating a duplicate. This makes retries safe for clients.

Deployment#

29. Use Progressive Rollouts#

Deploy changes to a small percentage of traffic first (canary), monitor error rates and latency, then gradually increase. Feature flags decouple deployment from release, allowing you to ship code without exposing it to users until it is validated.

30. Automate Everything in the Deploy Pipeline#

From linting to testing to building to deploying — every step should be automated and reproducible. Infrastructure as code (Terraform, Pulumi) ensures environments are consistent. GitOps (ArgoCD, Flux) ensures the deployed state matches the declared state in version control.

Quick Reference Table#

#	Practice	Category
1	Horizontal scaling	Scalability
2	Data partitioning	Scalability
3	Multi-layer caching	Scalability
4	Back-pressure	Scalability
5	Async communication	Scalability
6	Redundancy	Reliability
7	Circuit breakers	Reliability
8	Timeouts and retries	Reliability
9	Bulkheads	Reliability
10	Chaos engineering	Reliability
11	Three pillars of observability	Observability
12	SLIs, SLOs, error budgets	Observability
13	Symptom-based alerting	Observability
14	Structured logging	Observability
15	Service dashboards	Observability
16	Least privilege	Security
17	Encryption everywhere	Security
18	Input validation	Security
19	Rate limiting	Security
20	Zero trust networking	Security
21	Right database for the job	Data Management
22	CQRS	Data Management
23	Eventual consistency	Data Management
24	Idempotent operations	Data Management
25	Schema migration planning	Data Management
26	API versioning	API Design
27	Cursor-based pagination	API Design
28	Idempotent APIs	API Design
29	Progressive rollouts	Deployment
30	Automated deploy pipelines	Deployment

Conclusion#

These 30 practices are not theoretical — they are battle-tested patterns used by engineering teams at every scale. No system needs all 30 on day one. Start with the basics (timeouts, retries, structured logging, least privilege), then layer in more sophisticated patterns (CQRS, chaos engineering, zero trust) as the system grows. Revisit this list before every design review to make sure you have not overlooked a critical dimension.

Article #318 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

The AI Agent Tool Permission Matrix

4 min read

AI agents

AgentOps Observability for AI Agents

3 min read

AI agents

Non-Human Identity for AI Agents

3 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Notification System

Multi-channel notification platform with preferences, templating, and delivery tracking.

9 components

Build this architecture

Generate an interactive architecture for Top 30 System Design Best Practices in seconds.

Try it in Codelit →

system design best practicesscalabilityreliabilityobservabilitysecurityAPI designdata managementdeploymentsystem design

Top 30 System Design Best Practices: The Definitive Reference

March 29, 2026 8 min readBy Codelit Team Discussion

Scalability#

1. Design for Horizontal Scaling#

Scale out, not up. Stateless services behind a load balancer can add capacity linearly. Move session state to an external store (Redis, DynamoDB) so any instance can serve any request.

2. Partition Data Early#

3. Use Caching at Every Layer#

4. Apply Back-Pressure#

5. Prefer Asynchronous Communication#

Decouple producers from consumers with message queues or event streams. Async communication absorbs traffic spikes, enables independent scaling, and improves fault isolation.

Reliability#

6. Embrace Redundancy#

7. Implement Circuit Breakers#

When a downstream dependency fails, a circuit breaker stops sending requests after a threshold of failures. This prevents thread pool exhaustion and gives the failing service time to recover.

8. Set Timeouts and Retries with Backoff#

Every network call needs a timeout. Retries should use exponential backoff with jitter to avoid thundering herds. Cap the total retry duration to prevent requests from hanging indefinitely.

9. Use Bulkheads to Isolate Failures#

10. Practice Chaos Engineering#

Observability#

11. Instrument the Three Pillars#

12. Define SLIs, SLOs, and Error Budgets#

13. Alert on Symptoms, Not Causes#

14. Use Structured Logging#

Emit logs as JSON with consistent fields: timestamp, level, service, trace_id, message. Structured logs are searchable, filterable, and machine-parseable — unstructured strings are not.

15. Build Dashboards for Each Service#

Security#

16. Apply the Principle of Least Privilege#

17. Encrypt Data in Transit and at Rest#

18. Validate All Input#

19. Implement Rate Limiting and Throttling#

20. Adopt Zero Trust Networking#

Data Management#

21. Choose the Right Database for the Access Pattern#

22. Separate Reads from Writes (CQRS)#

23. Design for Eventual Consistency#

24. Implement Idempotent Operations#

Network retries and message redelivery mean operations will be executed more than once. Design writes to be idempotent using idempotency keys, conditional writes, or deduplication at the consumer.

25. Plan for Data Migration from Day One#

Schemas evolve. Use additive-only schema changes (add columns, never rename or remove in the same release). Version your data formats. Maintain backward compatibility for at least one release cycle.

API Design#

26. Version Your APIs#

Use URL path versioning (/v1/users) or header-based versioning. Never break existing clients with a deploy. Deprecate old versions with a published timeline and migration guide.

27. Use Pagination for List Endpoints#

28. Design Idempotent APIs#

Deployment#

29. Use Progressive Rollouts#

30. Automate Everything in the Deploy Pipeline#

Quick Reference Table#

#	Practice	Category
1	Horizontal scaling	Scalability
2	Data partitioning	Scalability
3	Multi-layer caching	Scalability
4	Back-pressure	Scalability
5	Async communication	Scalability
6	Redundancy	Reliability
7	Circuit breakers	Reliability
8	Timeouts and retries	Reliability
9	Bulkheads	Reliability
10	Chaos engineering	Reliability
11	Three pillars of observability	Observability
12	SLIs, SLOs, error budgets	Observability
13	Symptom-based alerting	Observability
14	Structured logging	Observability
15	Service dashboards	Observability
16	Least privilege	Security
17	Encryption everywhere	Security
18	Input validation	Security
19	Rate limiting	Security
20	Zero trust networking	Security
21	Right database for the job	Data Management
22	CQRS	Data Management
23	Eventual consistency	Data Management
24	Idempotent operations	Data Management
25	Schema migration planning	Data Management
26	API versioning	API Design
27	Cursor-based pagination	API Design
28	Idempotent APIs	API Design
29	Progressive rollouts	Deployment
30	Automated deploy pipelines	Deployment

Conclusion#

Article #318 of the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Top 30 System Design Best Practices in seconds.

Try it in Codelit →

Top 30 System Design Best Practices: The Definitive Reference

Scalability#

1. Design for Horizontal Scaling#

2. Partition Data Early#

3. Use Caching at Every Layer#

4. Apply Back-Pressure#

5. Prefer Asynchronous Communication#

Reliability#

6. Embrace Redundancy#

7. Implement Circuit Breakers#

8. Set Timeouts and Retries with Backoff#

9. Use Bulkheads to Isolate Failures#

10. Practice Chaos Engineering#

Observability#

11. Instrument the Three Pillars#

12. Define SLIs, SLOs, and Error Budgets#

13. Alert on Symptoms, Not Causes#

14. Use Structured Logging#

15. Build Dashboards for Each Service#

Security#

16. Apply the Principle of Least Privilege#

17. Encrypt Data in Transit and at Rest#

18. Validate All Input#

19. Implement Rate Limiting and Throttling#

20. Adopt Zero Trust Networking#

Data Management#

21. Choose the Right Database for the Access Pattern#

22. Separate Reads from Writes (CQRS)#

23. Design for Eventual Consistency#

24. Implement Idempotent Operations#

25. Plan for Data Migration from Day One#

API Design#

26. Version Your APIs#

27. Use Pagination for List Endpoints#

28. Design Idempotent APIs#

Deployment#

29. Use Progressive Rollouts#

30. Automate Everything in the Deploy Pipeline#

Quick Reference Table#

Conclusion#

Comments

Related articles

The AI Agent Tool Permission Matrix

AgentOps Observability for AI Agents

Non-Human Identity for AI Agents

Try these templates

Uber Real-Time Location System

E-Commerce Checkout System

Notification System

Build this architecture

Top 30 System Design Best Practices: The Definitive Reference

Scalability#

1. Design for Horizontal Scaling#

2. Partition Data Early#

3. Use Caching at Every Layer#

4. Apply Back-Pressure#

5. Prefer Asynchronous Communication#

Reliability#

6. Embrace Redundancy#

7. Implement Circuit Breakers#

8. Set Timeouts and Retries with Backoff#

9. Use Bulkheads to Isolate Failures#

10. Practice Chaos Engineering#

Observability#

11. Instrument the Three Pillars#

12. Define SLIs, SLOs, and Error Budgets#

13. Alert on Symptoms, Not Causes#

14. Use Structured Logging#

15. Build Dashboards for Each Service#

Security#

16. Apply the Principle of Least Privilege#

17. Encrypt Data in Transit and at Rest#

18. Validate All Input#

19. Implement Rate Limiting and Throttling#

20. Adopt Zero Trust Networking#

Data Management#

21. Choose the Right Database for the Access Pattern#

22. Separate Reads from Writes (CQRS)#

23. Design for Eventual Consistency#

24. Implement Idempotent Operations#