The System Design Encyclopedia: 250 Articles Covering Every Core Topic
The System Design Encyclopedia#
250 articles. What started as a single post about load balancing has grown into a comprehensive system design library. This milestone article is your organized reference — every topic grouped by category so you can find exactly what you need.
By the Numbers#
| Category | Articles | Coverage |
|---|---|---|
| Fundamentals | 18 | Core concepts every engineer needs |
| Distributed Systems | 20 | Consensus, messaging, failure handling |
| Architecture Patterns | 19 | Structural approaches for large systems |
| Interview Questions | 18 | Classic design problems with solutions |
| Infrastructure | 18 | Platform, deployment, and operations |
| Security | 12 | Auth, encryption, and threat mitigation |
| Data | 15 | Storage, processing, and pipelines |
| AI/ML Systems | 14 | Production ML and LLM infrastructure |
Every article includes practical examples, trade-off analysis, and production recommendations. No filler — each piece targets a specific concept you will encounter in real systems or interviews.
Fundamentals#
The building blocks every engineer should know cold.
- Load Balancing — distribute traffic across servers using round-robin, least connections, consistent hashing, and L4/L7 strategies
- Caching — reduce latency and database load with Redis, Memcached, CDN caching, and cache invalidation patterns
- Horizontal vs Vertical Scaling — when to scale up a single machine versus scaling out to many
- CAP Theorem — the fundamental trade-off between consistency, availability, and partition tolerance
- Latency and Throughput — measuring, benchmarking, and optimizing system performance
- DNS and Networking — how requests travel from browser to server and back
- API Design — REST, GraphQL, gRPC, and WebSocket patterns for clean interfaces
- Rate Limiting — protecting systems from abuse with token bucket, sliding window, and distributed rate limiters
- Idempotency — designing operations that are safe to retry without side effects
- Pagination — cursor-based, offset, and keyset pagination for large result sets
- Back-of-the-Envelope Estimation — quick math to validate system design decisions
- Proxies and Reverse Proxies — forwarding requests, SSL termination, and traffic shaping
- Content Delivery Networks — edge caching, cache invalidation, and global content distribution
- Hashing Algorithms — MD5, SHA, and when to use cryptographic vs non-cryptographic hashes
- TCP vs UDP — reliable vs fast delivery and when each protocol matters
- HTTP/2 and HTTP/3 — multiplexing, server push, QUIC, and modern protocol improvements
- Serialization Formats — JSON, Protocol Buffers, Avro, MessagePack, and trade-offs
- Webhooks — push-based integrations, retry logic, and security considerations
Distributed Systems#
The hard problems that emerge when you add a network between components.
- Consensus Algorithms — Raft, Paxos, and how distributed nodes agree on state
- Service Discovery — how microservices find each other with Consul, etcd, ZooKeeper, and Kubernetes DNS
- Distributed Transactions — two-phase commit, saga patterns, and eventual consistency
- Event-Driven Architecture — using events to decouple services with Kafka, RabbitMQ, and SNS/SQS
- Message Queues — reliable async communication between services
- Leader Election — choosing a coordinator in a distributed cluster
- Consistent Hashing — distributing data across nodes with minimal redistribution on changes
- Vector Clocks and CRDTs — tracking causality and resolving conflicts without coordination
- Gossip Protocols — how nodes share state in large decentralized clusters
- Circuit Breakers — preventing cascading failures when downstream services degrade
- Distributed Locking — coordinating exclusive access across multiple nodes with Redlock and ZooKeeper
- Write-Ahead Logs — durability and replication through append-only log structures
- Bulkhead Pattern — isolating failures to prevent system-wide outages
- Backpressure — handling overload by signaling producers to slow down
- Quorum Reads and Writes — tunable consistency with R + W > N guarantees
- Crashing vs Byzantine Failures — failure models and what your system should tolerate
- Cluster Membership — detecting joins, leaves, and failures in dynamic clusters
- Partitioned Logs — Kafka-style ordered, durable, partitioned event streams
- Anti-Entropy and Merkle Trees — detecting and repairing data inconsistencies between replicas
- Conflict Resolution — last-write-wins, merge functions, and application-level strategies
Architecture Patterns#
Structural approaches for organizing large systems.
- Microservices vs Monolith — when to split and when to stay together
- CQRS — separating read and write models for performance and scalability
- Event Sourcing — storing state as a sequence of events instead of current snapshots
- Domain-Driven Design — bounded contexts, aggregates, and ubiquitous language
- Hexagonal Architecture — ports and adapters for testable, framework-independent code
- Strangler Fig Pattern — incrementally migrating from monolith to microservices
- Sidecar and Ambassador Patterns — extending service functionality without code changes
- API Gateway — centralized entry point for routing, auth, rate limiting, and transformation
- BFF (Backend for Frontend) — tailored APIs for different client types
- Saga Pattern — managing distributed transactions through orchestration or choreography
- Cell-Based Architecture — isolating blast radius with independent, self-contained cells
- Multi-Tenancy — sharing infrastructure between tenants with proper isolation
- Feature Flags — decoupling deployment from release with progressive rollouts
- Clean Architecture — dependency inversion and layered boundaries for maintainable code
- Modular Monolith — monolith structure with clear module boundaries as a stepping stone
- Outbox Pattern — reliable event publishing from transactional databases
- Throttling and Debouncing — controlling request frequency at the application layer
- Plugin Architecture — extensible systems with runtime-loadable modules
- Service Mesh — infrastructure-layer networking with Istio, Linkerd, and Consul Connect
Interview Questions#
System design problems commonly asked in technical interviews.
- Design a URL Shortener — hashing, base62 encoding, read-heavy optimization
- Design a Chat System — WebSockets, message ordering, presence, and offline delivery
- Design a Rate Limiter — algorithms, distributed coordination, and edge cases
- Design a Notification System — multi-channel delivery, templating, preferences, and retries
- Design a News Feed — fan-out on write vs read, ranking, and caching strategies
- Design a Search Autocomplete — trie data structures, ranking, and latency optimization
- Design a File Storage System — chunking, deduplication, metadata, and CDN distribution
- Design a Metrics and Monitoring System — time-series storage, aggregation, and alerting
- Design a Payment System — idempotency, state machines, reconciliation, and PCI compliance
- Design a Video Streaming Platform — transcoding, adaptive bitrate, CDN, and DRM
- Design a Ride-Sharing Service — geospatial indexing, matching, pricing, and ETA
- Design a Distributed Cache — partitioning, eviction, replication, and consistency
- Design a Web Crawler — politeness, deduplication, frontier management, and distributed crawling
- Design a Ticket Booking System — seat locking, race conditions, overbooking prevention
- Design a Social Graph — friend-of-friend queries, graph storage, and privacy controls
- Design a Location-Based Service — geohashing, proximity search, and real-time tracking
- Design a Collaborative Editor — operational transforms, CRDTs, and real-time sync
- Design an Ad Serving System — auction mechanics, targeting, real-time bidding, and analytics
Infrastructure#
The platform layer that keeps everything running.
- Kubernetes — container orchestration, pod networking, and autoscaling
- CI/CD Pipelines — automated build, test, and deploy workflows
- Infrastructure as Code — Terraform, Pulumi, and declarative infrastructure management
- Container Networking — overlay networks, service mesh, and network policies
- Observability — logs, metrics, traces, and the three pillars of understanding production
- Chaos Engineering — intentionally breaking things to build resilience
- Blue-Green and Canary Deployments — safe release strategies with instant rollback
- Database Migration Strategies — zero-downtime schema changes and data migrations
- Auto-Scaling — CPU, queue depth, and custom metric-based scaling policies
- Connection Pooling — PgBouncer, ProxySQL, and managing database connections at scale
- Edge Computing — moving compute closer to users for latency-sensitive workloads
- GitOps — using Git as the single source of truth for infrastructure state
- Service Level Objectives — defining SLIs, SLOs, and SLAs with error budgets
- Incident Management — on-call rotations, runbooks, postmortems, and blameless culture
- Load Testing — stress testing with k6, Locust, and Gatling to find breaking points
- DNS and Traffic Management — weighted routing, failover, and geo-based DNS strategies
- Serverless Architecture — Lambda, Cloud Functions, and event-driven compute without servers
- Multi-Region Deployment — active-active, active-passive, and data replication across regions
Security#
Protecting systems, data, and users.
- Zero Trust Architecture — never trust, always verify — identity-based security for every request
- OAuth 2.0 and OIDC — modern authentication and authorization flows
- API Security — protecting APIs with authentication, encryption, and input validation
- Secrets Management — Vault, AWS Secrets Manager, and rotating credentials safely
- DDoS Protection — rate limiting, WAF, and traffic scrubbing at scale
- mTLS — mutual TLS for service-to-service encryption and authentication
- RBAC and ABAC — role-based and attribute-based access control models
- Supply Chain Security — securing dependencies, container images, and build pipelines
- Data Encryption — at rest, in transit, and application-layer encryption patterns
- CORS and CSP — browser security headers and cross-origin resource policies
- Penetration Testing — methodologies, tools, and integrating security into CI/CD
- JWT Security — token signing, rotation, revocation, and common pitfalls
Data#
Storage, processing, and movement of data at scale.
- Data Partitioning and Sharding — hash, range, directory, and geo sharding strategies
- Database Replication — leader-follower, multi-leader, and leaderless replication
- SQL vs NoSQL — choosing the right data model for your access patterns
- Time-Series Databases — storing and querying metrics, IoT, and financial data
- Data Lakes and Warehouses — centralized analytics storage with Snowflake, BigQuery, and Delta Lake
- Change Data Capture — streaming database changes with Debezium and Kafka Connect
- ETL and Data Pipelines — batch and streaming data transformation workflows
- Graph Databases — modeling relationships with Neo4j, Neptune, and Dgraph
- Bloom Filters and Probabilistic Data Structures — space-efficient membership testing
- LSM Trees and B-Trees — the storage engine foundations behind modern databases
- Data Governance — lineage, cataloging, quality, and compliance at scale
- Object Storage — S3, GCS, MinIO, and designing for unstructured data at petabyte scale
- Full-Text Search — Elasticsearch, OpenSearch, and inverted index architectures
- Data Versioning — tracking dataset changes for reproducibility and rollback
- Stream Processing — Flink, Spark Streaming, and real-time event transformation
AI/ML Systems#
The infrastructure behind machine learning in production.
- ML System Design — training pipelines, feature stores, model serving, and monitoring
- RAG Architecture — retrieval-augmented generation for grounded LLM applications
- Vector Databases — storing and querying embeddings with Pinecone, Weaviate, and pgvector
- Feature Stores — centralized feature management for training and serving consistency
- Model Serving — real-time inference, batching, A/B testing, and canary rollouts
- LLM Infrastructure — hosting, fine-tuning, prompt management, and cost optimization
- AI Gateway Patterns — routing, caching, fallback, and rate limiting for AI APIs
- Embedding Pipelines — generating, storing, and indexing vector embeddings at scale
- ML Observability — monitoring model performance, drift detection, and retraining triggers
- GPU Infrastructure — scheduling, multi-tenancy, and cost optimization for training workloads
- Data Labeling Pipelines — human-in-the-loop, active learning, and quality assurance
- A/B Testing for ML — experiment design, statistical significance, and model comparison
- Prompt Engineering Patterns — chain-of-thought, few-shot, and structured output techniques
- AI Agent Architecture — tool use, planning loops, memory, and orchestration frameworks
How to Use This Encyclopedia#
If you are preparing for interviews: Start with Fundamentals, then work through the Interview Questions section. Use the Architecture Patterns and Distributed Systems categories to deepen your answers.
If you are building production systems: Jump to the specific topic you need. Each article includes practical code examples, trade-off analysis, and real-world recommendations.
If you are learning system design from scratch: Read Fundamentals front to back, then branch into whichever category interests you most.
Recommended Learning Paths#
Path 1: Interview Prep (4-6 weeks)#
- Fundamentals (week 1-2) — load balancing, caching, CAP theorem, API design
- Architecture Patterns (week 3) — microservices, CQRS, event sourcing
- Distributed Systems (week 4) — consensus, consistent hashing, circuit breakers
- Interview Questions (week 5-6) — practice end-to-end designs with trade-off discussions
Path 2: Production Engineering (ongoing)#
- Infrastructure — Kubernetes, CI/CD, observability, auto-scaling
- Security — zero trust, mTLS, secrets management
- Data — partitioning, replication, CDC, stream processing
- Distributed Systems — deep dive into failure modes and recovery
Path 3: AI/ML Engineering#
- Fundamentals — API design, caching, rate limiting
- Data — vector databases, search, stream processing
- AI/ML Systems — RAG, model serving, embedding pipelines, AI gateways
- Infrastructure — GPU scheduling, serverless, observability
What Comes Next#
250 articles is a milestone, not a finish line. System design evolves as infrastructure evolves — new patterns emerge, old patterns get refined, and the community keeps pushing the boundaries of what distributed systems can do.
The next 250 will go deeper: more production war stories, more code-level implementations, more diagrams, and more coverage of the AI/ML infrastructure wave reshaping how we build systems.
250 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsBuild this architecture
Generate an interactive architecture for The System Design Encyclopedia in seconds.
Try it in Codelit →
Comments