Distributed Systems Fundamentals Every Developer Should Know
Distributed Systems Fundamentals Every Developer Should Know#
The moment your system runs on more than one machine, you're building a distributed system. And distributed systems have rules that can't be ignored.
This guide covers the fundamentals that every engineer needs — from CAP theorem to consensus algorithms, explained with practical examples.
The Eight Fallacies of Distributed Computing#
In 1994, Peter Deutsch listed assumptions that developers make about networks. They're all wrong:
- The network is reliable — packets get dropped, connections timeout
- Latency is zero — cross-region calls add 50-200ms
- Bandwidth is infinite — large payloads cause congestion
- The network is secure — encrypt everything, trust nothing
- Topology doesn't change — servers scale up, down, and fail
- There is one administrator — teams manage different parts
- Transport cost is zero — data transfer has real cost (especially cloud egress)
- The network is homogeneous — different protocols, versions, and configurations
Implication: Design for failure from day one.
CAP Theorem#
You can pick two out of three:
- Consistency — every read returns the most recent write
- Availability — every request gets a response (even if stale)
- Partition tolerance — system works despite network splits
Since network partitions are unavoidable, the real choice is:
| System | Choice | Example |
|---|---|---|
| CP | Consistency + Partition tolerance | ZooKeeper, etcd, Spanner |
| AP | Availability + Partition tolerance | Cassandra, DynamoDB, CouchDB |
CA (Consistency + Availability) only works on a single machine — not distributed.
In Practice#
Most systems aren't purely CP or AP. They're tunable:
- DynamoDB: configurable
ConsistentReadper query - Cassandra: tunable consistency level per query (ONE, QUORUM, ALL)
- PostgreSQL replicas: sync (CP) or async (AP) replication
Consistency Models#
From strongest to weakest:
1. Strong Consistency (Linearizability)#
Every read returns the most recent write. As if there's only one copy of the data.
Cost: Higher latency (need consensus), lower throughput Used by: Google Spanner, CockroachDB, etcd
2. Sequential Consistency#
All operations appear in some sequential order. Operations from each client appear in the order they were issued.
3. Causal Consistency#
Operations that are causally related are seen in order. Concurrent operations may be seen in any order.
Used by: MongoDB (causal sessions)
4. Eventual Consistency#
All replicas converge to the same state... eventually. Reads may return stale data.
Cost: Lowest latency, highest throughput Used by: DynamoDB (default), Cassandra, DNS, CDN caches
Which to Choose?#
| Use Case | Consistency Level |
|---|---|
| Bank account balance | Strong |
| User profile updates | Causal or Eventual |
| Social media feed | Eventual |
| Inventory count (e-commerce) | Strong (or compensate) |
| Analytics counters | Eventual |
| Chat message ordering | Causal |
Consensus Algorithms#
How do multiple nodes agree on a value?
Raft (Most Popular)#
- Leader election — one node is the leader, others are followers
- Log replication — leader receives writes, replicates to followers
- Commitment — once majority (quorum) acknowledges, write is committed
Client → Leader → Follower 1 (ACK)
→ Follower 2 (ACK) ← majority reached, committed!
→ Follower 3 (ACK)
Used by: etcd (Kubernetes), CockroachDB, Consul, TiKV
Paxos#
Theoretical foundation for consensus. More complex than Raft but equivalent in capability. Raft was designed to be "understandable Paxos."
Used by: Google Chubby, Spanner (Multi-Paxos)
Distributed Transactions#
Two-Phase Commit (2PC)#
A coordinator ensures all participants commit or all abort:
Phase 1 (Prepare): Coordinator → "Can you commit?"
Participant A → "Yes"
Participant B → "Yes"
Phase 2 (Commit): Coordinator → "Commit!"
Participant A → committed
Participant B → committed
Problem: If coordinator fails after Phase 1, participants are stuck (blocking protocol).
Saga Pattern#
Break the transaction into local transactions with compensating actions:
1. Order Service: create order ✓
2. Payment Service: charge card ✓
3. Inventory Service: reserve stock ✗ (failed!)
4. Compensate: Payment Service: refund card
5. Compensate: Order Service: cancel order
Better for microservices — no distributed locks, no coordinator bottleneck.
Failure Modes#
Crash Failure#
Server stops completely. Easiest to handle — health checks detect it, load balancer routes around it.
Byzantine Failure#
Server sends wrong/contradictory information. Hardest to handle — need Byzantine fault tolerance (BFT). Rare in controlled environments, common in blockchain.
Partial Failure#
Some components fail while others work. The defining characteristic of distributed systems.
Design principle: Always handle partial failure. Never assume all-or-nothing.
Key Patterns#
Circuit Breaker#
Stop calling a failing service to prevent cascade failures:
Closed (normal) → errors exceed threshold → Open (fail fast)
Open → timer expires → Half-Open (try one request)
Half-Open → success → Closed
Half-Open → failure → Open
Tools: Resilience4j, Polly, Istio
Retry with Exponential Backoff#
Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms (+ jitter)
Always add jitter — without it, all clients retry at the same time (thundering herd).
Idempotency#
Same request sent multiple times produces the same result.
POST /payments { idempotency_key: "abc123", amount: 100 }
→ First call: charges $100, returns payment_id
→ Second call: returns same payment_id (no double charge)
Critical for: Payments, order creation, any operation with side effects.
Bulkhead#
Isolate components so one failure doesn't take down everything:
Thread Pool A (20 threads) → Payment Service
Thread Pool B (10 threads) → Email Service
Thread Pool C (5 threads) → Analytics Service
If Analytics is slow, only Pool C is affected.
Clocks and Ordering#
Distributed systems have no global clock. Two approaches:
Logical Clocks (Lamport Timestamps)#
Increment a counter with every event. If A→B (A happened before B), then timestamp(A) < timestamp(B).
Vector Clocks#
Each node maintains a vector of counters. Can detect concurrent events (neither happened before the other).
Used by: DynamoDB (conflict detection), Riak
Hybrid Logical Clocks (HLC)#
Combine physical time with logical counters for causally-ordered timestamps.
Used by: CockroachDB, YugabyteDB
Architecture Example#
Distributed E-Commerce#
Client → API Gateway (L7 LB)
→ Order Service → PostgreSQL (primary + replica)
→ Payment Service → Stripe API
→ Inventory Service → Redis + PostgreSQL
→ Notification Service → Kafka → Email/Push workers
Cross-cutting:
→ Service Mesh (Envoy) for circuit breaking, retries, mTLS
→ Distributed Tracing (Jaeger) for request tracking
→ Centralized Logging (ELK) for debugging
Summary#
| Concept | Key Takeaway |
|---|---|
| CAP Theorem | Choose CP or AP per use case, not globally |
| Consistency | Strong for money, eventual for feeds |
| Consensus | Raft is the default — use etcd or similar |
| Transactions | Sagas over 2PC for microservices |
| Failures | Design for partial failure, always |
| Retries | Exponential backoff + jitter + idempotency |
| Circuit Breaker | Fail fast to prevent cascades |
| Clocks | No global time — use logical or vector clocks |
Design distributed systems at codelit.io — generate interactive architecture diagrams, simulate failures with Chaos Mode, and export infrastructure code.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Netflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsDistributed Rate Limiter
API rate limiting with sliding window, token bucket, and per-user quotas.
7 componentsSearch Engine Architecture
Web-scale search with crawling, indexing, ranking, and sub-second query serving.
8 componentsBuild this architecture
Generate an interactive architecture for Distributed Systems Fundamentals Every Developer Should Know in seconds.
Try it in Codelit →
Comments