400 Articles of System Design — The Definitive Library
Four hundred articles#
What started as a handful of notes on distributed systems has grown into a library of 400 articles covering every major area of system design. This capstone post organizes the entire collection into a structured learning resource — whether you are preparing for interviews, designing production systems, or deepening your engineering fundamentals.
The 12 categories#
Every article in the library falls into one of twelve categories. Here is the map.
1. Distributed systems fundamentals#
The bedrock. These articles cover the theoretical foundations that underpin everything else.
- CAP theorem and its practical implications
- Consensus protocols (Paxos, Raft, Zab)
- Distributed clocks (Lamport, vector, hybrid logical)
- Consistency models (linearizability, causal, eventual)
- Failure detection and membership protocols
- Split-brain prevention and fencing
Start here if: you want to understand why distributed systems behave the way they do.
2. Databases and storage#
From B-trees to LSM-trees, single-node to globally distributed.
- Relational databases (PostgreSQL internals, MySQL InnoDB)
- NoSQL paradigms (document, key-value, wide-column, graph)
- Write-ahead logs, MVCC, and transaction isolation
- Replication (synchronous, asynchronous, semi-synchronous)
- Sharding strategies and consistent hashing
- Time-series databases and columnar storage
- Database connection failover and high availability
Start here if: you work with data — which means everyone.
3. Caching#
The art of keeping hot data close.
- Cache invalidation strategies (TTL, write-through, write-behind)
- Redis architecture and clustering
- CDN caching and edge computing
- Multi-tier caching (browser, CDN, application, database)
- Cache stampede prevention
- Distributed cache consistency
Start here if: your system needs to be fast.
4. Networking and load balancing#
Moving bytes from A to B, reliably and quickly.
- TCP/IP deep dives, TLS handshakes, HTTP/2 and HTTP/3
- DNS-based load balancing and GeoDNS
- Layer 4 vs layer 7 load balancing
- Service mesh architecture (Envoy, Istio, Linkerd)
- gRPC, WebSockets, and Server-Sent Events
- Network partitions and their consequences
Start here if: you are debugging latency or designing API gateways.
5. Message queues and streaming#
Decoupling producers from consumers.
- Kafka architecture and exactly-once semantics
- RabbitMQ, SQS, and NATS comparison
- Event sourcing and CQRS
- Micro-batching vs true streaming
- Backpressure and flow control
- Dead letter queues and poison pill handling
Start here if: your system is event-driven or processes data pipelines.
6. API design#
The contract between systems.
- REST, GraphQL, and gRPC tradeoffs
- API versioning strategies
- Rate limiting and throttling
- Idempotency patterns
- Pagination (cursor-based, offset-based, keyset)
- API gateway patterns and BFF (Backend for Frontend)
Start here if: you are designing public or internal APIs.
7. Security and privacy#
Protecting data and systems.
- Zero-trust architecture
- Authentication (OAuth 2.0, OIDC, JWT, SAML)
- Authorization (RBAC, ABAC, ReBAC)
- Data anonymization techniques
- Encryption at rest and in transit
- Secrets management and key rotation
- Supply chain security
Start here if: you handle sensitive data or face compliance requirements.
8. Scalability patterns#
Growing from one server to thousands.
- Horizontal vs vertical scaling
- Database sharding and partitioning
- Read replicas and write scaling
- CQRS and event sourcing for scale
- Cell-based architecture
- Multi-tenancy patterns
Start here if: your system is outgrowing its current architecture.
9. Reliability and resilience#
Staying up when things go wrong.
- Circuit breakers and bulkheads
- Retry strategies with exponential backoff
- Chaos engineering principles
- Graceful degradation
- Feature flags and progressive rollouts
- Disaster recovery and RTO/RPO planning
- Database failover and replica promotion
Start here if: you carry a pager.
10. Observability#
Understanding what your system is doing.
- Distributed tracing (OpenTelemetry, Jaeger)
- Metrics design (RED, USE, golden signals)
- Structured logging at scale
- Alerting strategies that reduce noise
- SLOs, SLIs, and error budgets
- Profiling and flame graphs
Start here if: you are tired of guessing why things broke.
11. Infrastructure and deployment#
From code to production.
- Container orchestration (Kubernetes deep dives)
- CI/CD pipeline design
- Blue-green and canary deployments
- Infrastructure as code (Terraform, Pulumi)
- GitOps workflows
- Multi-region deployment patterns
- Cost optimization strategies
Start here if: you are building or maintaining the platform your services run on.
12. Data engineering#
Moving, transforming, and analysing data at scale.
- ETL vs ELT pipelines
- Data lake and lakehouse architecture
- Batch processing (Spark, MapReduce)
- Stream processing and micro-batching
- Data quality and observability
- Data governance and cataloguing
- Feature stores for ML
Start here if: you build the pipelines that feed analytics and machine learning.
Key insights from 400 articles#
After writing 400 articles, patterns emerge. Here are the most important recurring themes:
1. There are no silver bullets#
Every technology choice is a tradeoff. Synchronous replication guarantees consistency but adds latency. Caching improves speed but introduces staleness. Microservices enable team autonomy but add operational complexity.
The best engineers do not chase the latest tool — they understand the tradeoffs and choose deliberately.
2. Simplicity compounds#
The systems that survive longest are the ones that are easiest to understand. A monolith you can reason about beats a microservice sprawl nobody can debug.
Add complexity only when the problem demands it, not when the conference talk recommends it.
3. Failure is the default state#
In a distributed system, something is always failing. Design for failure from day one — retries, circuit breakers, graceful degradation, failover. The question is never if something will fail, but when and how gracefully.
4. Observability is not optional#
You cannot fix what you cannot see. Invest in logging, metrics, and tracing before you need them. The worst time to build observability is during an outage.
5. Data outlives code#
Your application will be rewritten. Your data will not. Design your data model carefully, choose your database thoughtfully, and protect your data above all else.
Learning paths#
Path 1: Interview preparation (4 weeks)#
Week 1 — Fundamentals: distributed systems basics, CAP theorem, consistency models, database internals
Week 2 — Building blocks: caching, load balancing, message queues, API design
Week 3 — Patterns: sharding, replication, CQRS, event sourcing, rate limiting
Week 4 — Practice: design a URL shortener, chat system, news feed, search engine using the building blocks from weeks 1-3
Path 2: Production engineering (6 weeks)#
Week 1-2 — Reliability: failover, circuit breakers, chaos engineering, disaster recovery
Week 3-4 — Observability: distributed tracing, metrics design, alerting, SLOs
Week 5-6 — Infrastructure: Kubernetes, CI/CD, multi-region deployment, cost optimization
Path 3: Data engineering (6 weeks)#
Week 1-2 — Storage: database internals, replication, sharding, time-series databases
Week 3-4 — Processing: batch processing, stream processing, micro-batching, exactly-once semantics
Week 5-6 — Architecture: data lakes, data governance, feature stores, data quality
Interview preparation strategy#
The framework#
Every system design interview follows the same structure. Use this framework:
-
Clarify requirements (2-3 minutes) — functional requirements, non-functional requirements (latency, throughput, availability), scale (users, data volume, read/write ratio)
-
High-level design (5-7 minutes) — draw the major components (clients, load balancer, application servers, database, cache, message queue) and explain the data flow
-
Deep dive (15-20 minutes) — pick 2-3 components and design them in detail. This is where knowledge of the 400 articles pays off.
-
Address bottlenecks (5 minutes) — identify single points of failure, discuss scaling strategies, propose monitoring and alerting
What interviewers actually evaluate#
- Communication — can you explain your thinking clearly?
- Tradeoff analysis — can you articulate why you chose A over B?
- Breadth — do you know the building blocks?
- Depth — can you go deep on at least one area?
- Pragmatism — do you design for the stated requirements, or over-engineer?
The most common mistakes#
- Jumping into the solution without clarifying requirements
- Designing for Google scale when the problem says "10,000 users"
- Mentioning technologies without explaining why they are the right choice
- Ignoring failure modes and edge cases
- Not discussing monitoring and observability
What comes next#
Four hundred articles is a milestone, not a finish line. System design is a living discipline — new tools, new patterns, and new challenges emerge constantly. The library will continue to grow.
If you have read even a fraction of these articles, you have a foundation that will serve you for years. The principles do not change even as the tools do.
Build simple systems. Design for failure. Observe everything. And never stop learning.
400 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsBuild this architecture
Generate an interactive architecture for 400 Articles of System Design in seconds.
Try it in Codelit →
Comments