system-designmilestonearchitectureinterviewslearning

400 Articles of System Design — The Definitive Library

March 29, 2026 8 min readBy Codelit Team Discussion

Four hundred articles#

What started as a handful of notes on distributed systems has grown into a library of 400 articles covering every major area of system design. This capstone post organizes the entire collection into a structured learning resource — whether you are preparing for interviews, designing production systems, or deepening your engineering fundamentals.

The 12 categories#

Every article in the library falls into one of twelve categories. Here is the map.

1. Distributed systems fundamentals#

The bedrock. These articles cover the theoretical foundations that underpin everything else.

CAP theorem and its practical implications
Consensus protocols (Paxos, Raft, Zab)
Distributed clocks (Lamport, vector, hybrid logical)
Consistency models (linearizability, causal, eventual)
Failure detection and membership protocols
Split-brain prevention and fencing

Start here if: you want to understand why distributed systems behave the way they do.

2. Databases and storage#

From B-trees to LSM-trees, single-node to globally distributed.

Relational databases (PostgreSQL internals, MySQL InnoDB)
NoSQL paradigms (document, key-value, wide-column, graph)
Write-ahead logs, MVCC, and transaction isolation
Replication (synchronous, asynchronous, semi-synchronous)
Sharding strategies and consistent hashing
Time-series databases and columnar storage
Database connection failover and high availability

Start here if: you work with data — which means everyone.

3. Caching#

The art of keeping hot data close.

Cache invalidation strategies (TTL, write-through, write-behind)
Redis architecture and clustering
CDN caching and edge computing
Multi-tier caching (browser, CDN, application, database)
Cache stampede prevention
Distributed cache consistency

Start here if: your system needs to be fast.

4. Networking and load balancing#

Moving bytes from A to B, reliably and quickly.

TCP/IP deep dives, TLS handshakes, HTTP/2 and HTTP/3
DNS-based load balancing and GeoDNS
Layer 4 vs layer 7 load balancing
Service mesh architecture (Envoy, Istio, Linkerd)
gRPC, WebSockets, and Server-Sent Events
Network partitions and their consequences

Start here if: you are debugging latency or designing API gateways.

5. Message queues and streaming#

Decoupling producers from consumers.

Kafka architecture and exactly-once semantics
RabbitMQ, SQS, and NATS comparison
Event sourcing and CQRS
Micro-batching vs true streaming
Backpressure and flow control
Dead letter queues and poison pill handling

Start here if: your system is event-driven or processes data pipelines.

6. API design#

The contract between systems.

REST, GraphQL, and gRPC tradeoffs
API versioning strategies
Rate limiting and throttling
Idempotency patterns
Pagination (cursor-based, offset-based, keyset)
API gateway patterns and BFF (Backend for Frontend)

Start here if: you are designing public or internal APIs.

7. Security and privacy#

Protecting data and systems.

Zero-trust architecture
Authentication (OAuth 2.0, OIDC, JWT, SAML)
Authorization (RBAC, ABAC, ReBAC)
Data anonymization techniques
Encryption at rest and in transit
Secrets management and key rotation
Supply chain security

Start here if: you handle sensitive data or face compliance requirements.

8. Scalability patterns#

Growing from one server to thousands.

Horizontal vs vertical scaling
Database sharding and partitioning
Read replicas and write scaling
CQRS and event sourcing for scale
Cell-based architecture
Multi-tenancy patterns

Start here if: your system is outgrowing its current architecture.

9. Reliability and resilience#

Staying up when things go wrong.

Circuit breakers and bulkheads
Retry strategies with exponential backoff
Chaos engineering principles
Graceful degradation
Feature flags and progressive rollouts
Disaster recovery and RTO/RPO planning
Database failover and replica promotion

Start here if: you carry a pager.

10. Observability#

Understanding what your system is doing.

Distributed tracing (OpenTelemetry, Jaeger)
Metrics design (RED, USE, golden signals)
Structured logging at scale
Alerting strategies that reduce noise
SLOs, SLIs, and error budgets
Profiling and flame graphs

Start here if: you are tired of guessing why things broke.

11. Infrastructure and deployment#

From code to production.

Container orchestration (Kubernetes deep dives)
CI/CD pipeline design
Blue-green and canary deployments
Infrastructure as code (Terraform, Pulumi)
GitOps workflows
Multi-region deployment patterns
Cost optimization strategies

Start here if: you are building or maintaining the platform your services run on.

12. Data engineering#

Moving, transforming, and analysing data at scale.

ETL vs ELT pipelines
Data lake and lakehouse architecture
Batch processing (Spark, MapReduce)
Stream processing and micro-batching
Data quality and observability
Data governance and cataloguing
Feature stores for ML

Start here if: you build the pipelines that feed analytics and machine learning.

Key insights from 400 articles#

After writing 400 articles, patterns emerge. Here are the most important recurring themes:

1. There are no silver bullets#

Every technology choice is a tradeoff. Synchronous replication guarantees consistency but adds latency. Caching improves speed but introduces staleness. Microservices enable team autonomy but add operational complexity.

The best engineers do not chase the latest tool — they understand the tradeoffs and choose deliberately.

2. Simplicity compounds#

The systems that survive longest are the ones that are easiest to understand. A monolith you can reason about beats a microservice sprawl nobody can debug.

Add complexity only when the problem demands it, not when the conference talk recommends it.

3. Failure is the default state#

In a distributed system, something is always failing. Design for failure from day one — retries, circuit breakers, graceful degradation, failover. The question is never if something will fail, but when and how gracefully.

4. Observability is not optional#

You cannot fix what you cannot see. Invest in logging, metrics, and tracing before you need them. The worst time to build observability is during an outage.

5. Data outlives code#

Your application will be rewritten. Your data will not. Design your data model carefully, choose your database thoughtfully, and protect your data above all else.

Learning paths#

Path 1: Interview preparation (4 weeks)#

Week 1 — Fundamentals: distributed systems basics, CAP theorem, consistency models, database internals

Week 2 — Building blocks: caching, load balancing, message queues, API design

Week 3 — Patterns: sharding, replication, CQRS, event sourcing, rate limiting

Week 4 — Practice: design a URL shortener, chat system, news feed, search engine using the building blocks from weeks 1-3

Path 2: Production engineering (6 weeks)#

Week 1-2 — Reliability: failover, circuit breakers, chaos engineering, disaster recovery

Week 3-4 — Observability: distributed tracing, metrics design, alerting, SLOs

Week 5-6 — Infrastructure: Kubernetes, CI/CD, multi-region deployment, cost optimization

Path 3: Data engineering (6 weeks)#

Week 1-2 — Storage: database internals, replication, sharding, time-series databases

Week 3-4 — Processing: batch processing, stream processing, micro-batching, exactly-once semantics

Week 5-6 — Architecture: data lakes, data governance, feature stores, data quality

Interview preparation strategy#

The framework#

Every system design interview follows the same structure. Use this framework:

Clarify requirements (2-3 minutes) — functional requirements, non-functional requirements (latency, throughput, availability), scale (users, data volume, read/write ratio)
High-level design (5-7 minutes) — draw the major components (clients, load balancer, application servers, database, cache, message queue) and explain the data flow
Deep dive (15-20 minutes) — pick 2-3 components and design them in detail. This is where knowledge of the 400 articles pays off.
Address bottlenecks (5 minutes) — identify single points of failure, discuss scaling strategies, propose monitoring and alerting

What interviewers actually evaluate#

Communication — can you explain your thinking clearly?
Tradeoff analysis — can you articulate why you chose A over B?
Breadth — do you know the building blocks?
Depth — can you go deep on at least one area?
Pragmatism — do you design for the stated requirements, or over-engineer?

The most common mistakes#

Jumping into the solution without clarifying requirements
Designing for Google scale when the problem says "10,000 users"
Mentioning technologies without explaining why they are the right choice
Ignoring failure modes and edge cases
Not discussing monitoring and observability

What comes next#

Four hundred articles is a milestone, not a finish line. System design is a living discipline — new tools, new patterns, and new challenges emerge constantly. The library will continue to grow.

If you have read even a fraction of these articles, you have a foundation that will serve you for years. The principles do not change even as the tools do.

Build simple systems. Design for failure. Observe everything. And never stop learning.

400 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Uber Real-Time Location System

Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.

6 components

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

E-Commerce Checkout System

Production checkout flow with Stripe payments, inventory management, and fraud detection.

11 components

Build this architecture

Generate an interactive architecture for 400 Articles of System Design in seconds.

Try it in Codelit →

system-designmilestonearchitectureinterviewslearning

400 Articles of System Design — The Definitive Library

March 29, 2026 8 min readBy Codelit Team Discussion

Four hundred articles#

The 12 categories#

Every article in the library falls into one of twelve categories. Here is the map.

1. Distributed systems fundamentals#

The bedrock. These articles cover the theoretical foundations that underpin everything else.

CAP theorem and its practical implications
Consensus protocols (Paxos, Raft, Zab)
Distributed clocks (Lamport, vector, hybrid logical)
Consistency models (linearizability, causal, eventual)
Failure detection and membership protocols
Split-brain prevention and fencing

Start here if: you want to understand why distributed systems behave the way they do.

2. Databases and storage#

From B-trees to LSM-trees, single-node to globally distributed.

Relational databases (PostgreSQL internals, MySQL InnoDB)
NoSQL paradigms (document, key-value, wide-column, graph)
Write-ahead logs, MVCC, and transaction isolation
Replication (synchronous, asynchronous, semi-synchronous)
Sharding strategies and consistent hashing
Time-series databases and columnar storage
Database connection failover and high availability

Start here if: you work with data — which means everyone.

3. Caching#

The art of keeping hot data close.

Cache invalidation strategies (TTL, write-through, write-behind)
Redis architecture and clustering
CDN caching and edge computing
Multi-tier caching (browser, CDN, application, database)
Cache stampede prevention
Distributed cache consistency

Start here if: your system needs to be fast.

4. Networking and load balancing#

Moving bytes from A to B, reliably and quickly.

TCP/IP deep dives, TLS handshakes, HTTP/2 and HTTP/3
DNS-based load balancing and GeoDNS
Layer 4 vs layer 7 load balancing
Service mesh architecture (Envoy, Istio, Linkerd)
gRPC, WebSockets, and Server-Sent Events
Network partitions and their consequences

Start here if: you are debugging latency or designing API gateways.

5. Message queues and streaming#

Decoupling producers from consumers.

Kafka architecture and exactly-once semantics
RabbitMQ, SQS, and NATS comparison
Event sourcing and CQRS
Micro-batching vs true streaming
Backpressure and flow control
Dead letter queues and poison pill handling

Start here if: your system is event-driven or processes data pipelines.

6. API design#

The contract between systems.

REST, GraphQL, and gRPC tradeoffs
API versioning strategies
Rate limiting and throttling
Idempotency patterns
Pagination (cursor-based, offset-based, keyset)
API gateway patterns and BFF (Backend for Frontend)

Start here if: you are designing public or internal APIs.

7. Security and privacy#

Protecting data and systems.

Zero-trust architecture
Authentication (OAuth 2.0, OIDC, JWT, SAML)
Authorization (RBAC, ABAC, ReBAC)
Data anonymization techniques
Encryption at rest and in transit
Secrets management and key rotation
Supply chain security

Start here if: you handle sensitive data or face compliance requirements.

8. Scalability patterns#

Growing from one server to thousands.

Horizontal vs vertical scaling
Database sharding and partitioning
Read replicas and write scaling
CQRS and event sourcing for scale
Cell-based architecture
Multi-tenancy patterns

Start here if: your system is outgrowing its current architecture.

9. Reliability and resilience#

Staying up when things go wrong.

Circuit breakers and bulkheads
Retry strategies with exponential backoff
Chaos engineering principles
Graceful degradation
Feature flags and progressive rollouts
Disaster recovery and RTO/RPO planning
Database failover and replica promotion

Start here if: you carry a pager.

10. Observability#

Understanding what your system is doing.

Distributed tracing (OpenTelemetry, Jaeger)
Metrics design (RED, USE, golden signals)
Structured logging at scale
Alerting strategies that reduce noise
SLOs, SLIs, and error budgets
Profiling and flame graphs

Start here if: you are tired of guessing why things broke.

11. Infrastructure and deployment#

From code to production.

Container orchestration (Kubernetes deep dives)
CI/CD pipeline design
Blue-green and canary deployments
Infrastructure as code (Terraform, Pulumi)
GitOps workflows
Multi-region deployment patterns
Cost optimization strategies

Start here if: you are building or maintaining the platform your services run on.

12. Data engineering#

Moving, transforming, and analysing data at scale.

ETL vs ELT pipelines
Data lake and lakehouse architecture
Batch processing (Spark, MapReduce)
Stream processing and micro-batching
Data quality and observability
Data governance and cataloguing
Feature stores for ML

Start here if: you build the pipelines that feed analytics and machine learning.

Key insights from 400 articles#

After writing 400 articles, patterns emerge. Here are the most important recurring themes:

1. There are no silver bullets#

The best engineers do not chase the latest tool — they understand the tradeoffs and choose deliberately.

2. Simplicity compounds#

The systems that survive longest are the ones that are easiest to understand. A monolith you can reason about beats a microservice sprawl nobody can debug.

Add complexity only when the problem demands it, not when the conference talk recommends it.

3. Failure is the default state#

4. Observability is not optional#

You cannot fix what you cannot see. Invest in logging, metrics, and tracing before you need them. The worst time to build observability is during an outage.

5. Data outlives code#

Your application will be rewritten. Your data will not. Design your data model carefully, choose your database thoughtfully, and protect your data above all else.

Learning paths#

Path 1: Interview preparation (4 weeks)#

Week 1 — Fundamentals: distributed systems basics, CAP theorem, consistency models, database internals

Week 2 — Building blocks: caching, load balancing, message queues, API design

Week 3 — Patterns: sharding, replication, CQRS, event sourcing, rate limiting

Week 4 — Practice: design a URL shortener, chat system, news feed, search engine using the building blocks from weeks 1-3

Path 2: Production engineering (6 weeks)#

Week 1-2 — Reliability: failover, circuit breakers, chaos engineering, disaster recovery

Week 3-4 — Observability: distributed tracing, metrics design, alerting, SLOs

Week 5-6 — Infrastructure: Kubernetes, CI/CD, multi-region deployment, cost optimization

Path 3: Data engineering (6 weeks)#

Week 1-2 — Storage: database internals, replication, sharding, time-series databases

Week 3-4 — Processing: batch processing, stream processing, micro-batching, exactly-once semantics

Week 5-6 — Architecture: data lakes, data governance, feature stores, data quality

Interview preparation strategy#

The framework#

Every system design interview follows the same structure. Use this framework:

Clarify requirements (2-3 minutes) — functional requirements, non-functional requirements (latency, throughput, availability), scale (users, data volume, read/write ratio)
High-level design (5-7 minutes) — draw the major components (clients, load balancer, application servers, database, cache, message queue) and explain the data flow
Deep dive (15-20 minutes) — pick 2-3 components and design them in detail. This is where knowledge of the 400 articles pays off.
Address bottlenecks (5 minutes) — identify single points of failure, discuss scaling strategies, propose monitoring and alerting

What interviewers actually evaluate#

Communication — can you explain your thinking clearly?
Tradeoff analysis — can you articulate why you chose A over B?
Breadth — do you know the building blocks?
Depth — can you go deep on at least one area?
Pragmatism — do you design for the stated requirements, or over-engineer?

The most common mistakes#

Jumping into the solution without clarifying requirements
Designing for Google scale when the problem says "10,000 users"
Mentioning technologies without explaining why they are the right choice
Ignoring failure modes and edge cases
Not discussing monitoring and observability

What comes next#

Four hundred articles is a milestone, not a finish line. System design is a living discipline — new tools, new patterns, and new challenges emerge constantly. The library will continue to grow.

If you have read even a fraction of these articles, you have a foundation that will serve you for years. The principles do not change even as the tools do.

Build simple systems. Design for failure. Observe everything. And never stop learning.

400 articles on system design at codelit.io/blog.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for 400 Articles of System Design in seconds.

Try it in Codelit →

400 Articles of System Design — The Definitive Library

Four hundred articles#

The 12 categories#

1. Distributed systems fundamentals#

2. Databases and storage#

3. Caching#

4. Networking and load balancing#

5. Message queues and streaming#

6. API design#

7. Security and privacy#

8. Scalability patterns#

9. Reliability and resilience#

10. Observability#

11. Infrastructure and deployment#

12. Data engineering#

Key insights from 400 articles#

1. There are no silver bullets#

2. Simplicity compounds#

3. Failure is the default state#

4. Observability is not optional#

5. Data outlives code#

Learning paths#

Path 1: Interview preparation (4 weeks)#

Path 2: Production engineering (6 weeks)#

Path 3: Data engineering (6 weeks)#

Interview preparation strategy#

The framework#

What interviewers actually evaluate#

The most common mistakes#

What comes next#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Uber Real-Time Location System

Netflix Video Streaming Architecture

E-Commerce Checkout System

Build this architecture

400 Articles of System Design — The Definitive Library

Four hundred articles#

The 12 categories#

1. Distributed systems fundamentals#

2. Databases and storage#

3. Caching#

4. Networking and load balancing#

5. Message queues and streaming#

6. API design#

7. Security and privacy#

8. Scalability patterns#

9. Reliability and resilience#

10. Observability#

11. Infrastructure and deployment#

12. Data engineering#

Key insights from 400 articles#

1. There are no silver bullets#

2. Simplicity compounds#

3. Failure is the default state#

4. Observability is not optional#

5. Data outlives code#

Learning paths#

Path 1: Interview preparation (4 weeks)#

Path 2: Production engineering (6 weeks)#

Path 3: Data engineering (6 weeks)#

Interview preparation strategy#

The framework#

What interviewers actually evaluate#

The most common mistakes#

What comes next#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Uber Real-Time Location System

Netflix Video Streaming Architecture

E-Commerce Checkout System

Build this architecture