distributed systemssystem designCAP theoremconsensusarchitecture

Distributed Systems Fundamentals Every Developer Should Know

March 28, 2026 7 min readBy Codelit Team Discussion

Distributed Systems Fundamentals Every Developer Should Know#

The moment your system runs on more than one machine, you're building a distributed system. And distributed systems have rules that can't be ignored.

This guide covers the fundamentals that every engineer needs — from CAP theorem to consensus algorithms, explained with practical examples.

The Eight Fallacies of Distributed Computing#

In 1994, Peter Deutsch listed assumptions that developers make about networks. They're all wrong:

The network is reliable — packets get dropped, connections timeout
Latency is zero — cross-region calls add 50-200ms
Bandwidth is infinite — large payloads cause congestion
The network is secure — encrypt everything, trust nothing
Topology doesn't change — servers scale up, down, and fail
There is one administrator — teams manage different parts
Transport cost is zero — data transfer has real cost (especially cloud egress)
The network is homogeneous — different protocols, versions, and configurations

Implication: Design for failure from day one.

CAP Theorem#

You can pick two out of three:

Consistency — every read returns the most recent write
Availability — every request gets a response (even if stale)
Partition tolerance — system works despite network splits

Since network partitions are unavoidable, the real choice is:

System	Choice	Example
CP	Consistency + Partition tolerance	ZooKeeper, etcd, Spanner
AP	Availability + Partition tolerance	Cassandra, DynamoDB, CouchDB

CA (Consistency + Availability) only works on a single machine — not distributed.

In Practice#

Most systems aren't purely CP or AP. They're tunable:

DynamoDB: configurable ConsistentRead per query
Cassandra: tunable consistency level per query (ONE, QUORUM, ALL)
PostgreSQL replicas: sync (CP) or async (AP) replication

Consistency Models#

From strongest to weakest:

1. Strong Consistency (Linearizability)#

Every read returns the most recent write. As if there's only one copy of the data.

Cost: Higher latency (need consensus), lower throughput Used by: Google Spanner, CockroachDB, etcd

2. Sequential Consistency#

All operations appear in some sequential order. Operations from each client appear in the order they were issued.

3. Causal Consistency#

Operations that are causally related are seen in order. Concurrent operations may be seen in any order.

Used by: MongoDB (causal sessions)

4. Eventual Consistency#

All replicas converge to the same state... eventually. Reads may return stale data.

Cost: Lowest latency, highest throughput Used by: DynamoDB (default), Cassandra, DNS, CDN caches

Which to Choose?#

Use Case	Consistency Level
Bank account balance	Strong
User profile updates	Causal or Eventual
Social media feed	Eventual
Inventory count (e-commerce)	Strong (or compensate)
Analytics counters	Eventual
Chat message ordering	Causal

Consensus Algorithms#

How do multiple nodes agree on a value?

Raft (Most Popular)#

Leader election — one node is the leader, others are followers
Log replication — leader receives writes, replicates to followers
Commitment — once majority (quorum) acknowledges, write is committed

Client → Leader → Follower 1 (ACK)
                → Follower 2 (ACK) ← majority reached, committed!
                → Follower 3 (ACK)

Used by: etcd (Kubernetes), CockroachDB, Consul, TiKV

Paxos#

Theoretical foundation for consensus. More complex than Raft but equivalent in capability. Raft was designed to be "understandable Paxos."

Used by: Google Chubby, Spanner (Multi-Paxos)

Distributed Transactions#

Two-Phase Commit (2PC)#

A coordinator ensures all participants commit or all abort:

Phase 1 (Prepare): Coordinator → "Can you commit?"
                    Participant A → "Yes"
                    Participant B → "Yes"

Phase 2 (Commit):  Coordinator → "Commit!"
                    Participant A → committed
                    Participant B → committed

Problem: If coordinator fails after Phase 1, participants are stuck (blocking protocol).

Saga Pattern#

Break the transaction into local transactions with compensating actions:

1. Order Service: create order ✓
2. Payment Service: charge card ✓
3. Inventory Service: reserve stock ✗ (failed!)
4. Compensate: Payment Service: refund card
5. Compensate: Order Service: cancel order

Better for microservices — no distributed locks, no coordinator bottleneck.

Failure Modes#

Crash Failure#

Server stops completely. Easiest to handle — health checks detect it, load balancer routes around it.

Byzantine Failure#

Server sends wrong/contradictory information. Hardest to handle — need Byzantine fault tolerance (BFT). Rare in controlled environments, common in blockchain.

Partial Failure#

Some components fail while others work. The defining characteristic of distributed systems.

Design principle: Always handle partial failure. Never assume all-or-nothing.

Key Patterns#

Circuit Breaker#

Stop calling a failing service to prevent cascade failures:

Closed (normal) → errors exceed threshold → Open (fail fast)
Open → timer expires → Half-Open (try one request)
Half-Open → success → Closed
Half-Open → failure → Open

Tools: Resilience4j, Polly, Istio

Retry with Exponential Backoff#

Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms (+ jitter)

Always add jitter — without it, all clients retry at the same time (thundering herd).

Idempotency#

Same request sent multiple times produces the same result.

POST /payments { idempotency_key: "abc123", amount: 100 }
→ First call: charges $100, returns payment_id
→ Second call: returns same payment_id (no double charge)

Critical for: Payments, order creation, any operation with side effects.

Bulkhead#

Isolate components so one failure doesn't take down everything:

Thread Pool A (20 threads) → Payment Service
Thread Pool B (10 threads) → Email Service
Thread Pool C (5 threads)  → Analytics Service

If Analytics is slow, only Pool C is affected.

Clocks and Ordering#

Distributed systems have no global clock. Two approaches:

Logical Clocks (Lamport Timestamps)#

Increment a counter with every event. If A→B (A happened before B), then timestamp(A) < timestamp(B).

Vector Clocks#

Each node maintains a vector of counters. Can detect concurrent events (neither happened before the other).

Used by: DynamoDB (conflict detection), Riak

Hybrid Logical Clocks (HLC)#

Combine physical time with logical counters for causally-ordered timestamps.

Used by: CockroachDB, YugabyteDB

Architecture Example#

Distributed E-Commerce#

Client → API Gateway (L7 LB)
              → Order Service → PostgreSQL (primary + replica)
              → Payment Service → Stripe API
              → Inventory Service → Redis + PostgreSQL
              → Notification Service → Kafka → Email/Push workers

Cross-cutting:
  → Service Mesh (Envoy) for circuit breaking, retries, mTLS
  → Distributed Tracing (Jaeger) for request tracking
  → Centralized Logging (ELK) for debugging

Generate a distributed architecture →

Summary#

Concept	Key Takeaway
CAP Theorem	Choose CP or AP per use case, not globally
Consistency	Strong for money, eventual for feeds
Consensus	Raft is the default — use etcd or similar
Transactions	Sagas over 2PC for microservices
Failures	Design for partial failure, always
Retries	Exponential backoff + jitter + idempotency
Circuit Breaker	Fail fast to prevent cascades
Clocks	No global time — use logical or vector clocks

Design distributed systems at codelit.io — generate interactive architecture diagrams, simulate failures with Chaos Mode, and export infrastructure code.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Context Engineering for Agentic Systems

2 min read

AI agents

AI Agent Memory Architecture

2 min read

AI agents

Production AI Agent Deployment Checklist

2 min read

Try these templates

Netflix Video Streaming Architecture

Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.

10 components

Distributed Rate Limiter

API rate limiting with sliding window, token bucket, and per-user quotas.

7 components

Search Engine Architecture

Web-scale search with crawling, indexing, ranking, and sub-second query serving.

8 components

Build this architecture

Generate an interactive architecture for Distributed Systems Fundamentals Every Developer Should Know in seconds.

Try it in Codelit →

distributed systemssystem designCAP theoremconsensusarchitecture

Distributed Systems Fundamentals Every Developer Should Know

March 28, 2026 7 min readBy Codelit Team Discussion

Distributed Systems Fundamentals Every Developer Should Know#

The moment your system runs on more than one machine, you're building a distributed system. And distributed systems have rules that can't be ignored.

This guide covers the fundamentals that every engineer needs — from CAP theorem to consensus algorithms, explained with practical examples.

The Eight Fallacies of Distributed Computing#

In 1994, Peter Deutsch listed assumptions that developers make about networks. They're all wrong:

The network is reliable — packets get dropped, connections timeout
Latency is zero — cross-region calls add 50-200ms
Bandwidth is infinite — large payloads cause congestion
The network is secure — encrypt everything, trust nothing
Topology doesn't change — servers scale up, down, and fail
There is one administrator — teams manage different parts
Transport cost is zero — data transfer has real cost (especially cloud egress)
The network is homogeneous — different protocols, versions, and configurations

Implication: Design for failure from day one.

CAP Theorem#

You can pick two out of three:

Consistency — every read returns the most recent write
Availability — every request gets a response (even if stale)
Partition tolerance — system works despite network splits

Since network partitions are unavoidable, the real choice is:

System	Choice	Example
CP	Consistency + Partition tolerance	ZooKeeper, etcd, Spanner
AP	Availability + Partition tolerance	Cassandra, DynamoDB, CouchDB

CA (Consistency + Availability) only works on a single machine — not distributed.

In Practice#

Most systems aren't purely CP or AP. They're tunable:

DynamoDB: configurable ConsistentRead per query
Cassandra: tunable consistency level per query (ONE, QUORUM, ALL)
PostgreSQL replicas: sync (CP) or async (AP) replication

Consistency Models#

From strongest to weakest:

1. Strong Consistency (Linearizability)#

Every read returns the most recent write. As if there's only one copy of the data.

Cost: Higher latency (need consensus), lower throughput Used by: Google Spanner, CockroachDB, etcd

2. Sequential Consistency#

All operations appear in some sequential order. Operations from each client appear in the order they were issued.

3. Causal Consistency#

Operations that are causally related are seen in order. Concurrent operations may be seen in any order.

Used by: MongoDB (causal sessions)

4. Eventual Consistency#

All replicas converge to the same state... eventually. Reads may return stale data.

Cost: Lowest latency, highest throughput Used by: DynamoDB (default), Cassandra, DNS, CDN caches

Which to Choose?#

Use Case	Consistency Level
Bank account balance	Strong
User profile updates	Causal or Eventual
Social media feed	Eventual
Inventory count (e-commerce)	Strong (or compensate)
Analytics counters	Eventual
Chat message ordering	Causal

Consensus Algorithms#

How do multiple nodes agree on a value?

Raft (Most Popular)#

Leader election — one node is the leader, others are followers
Log replication — leader receives writes, replicates to followers
Commitment — once majority (quorum) acknowledges, write is committed

Client → Leader → Follower 1 (ACK)
                → Follower 2 (ACK) ← majority reached, committed!
                → Follower 3 (ACK)

Used by: etcd (Kubernetes), CockroachDB, Consul, TiKV

Paxos#

Theoretical foundation for consensus. More complex than Raft but equivalent in capability. Raft was designed to be "understandable Paxos."

Used by: Google Chubby, Spanner (Multi-Paxos)

Distributed Transactions#

Two-Phase Commit (2PC)#

A coordinator ensures all participants commit or all abort:

Phase 1 (Prepare): Coordinator → "Can you commit?"
                    Participant A → "Yes"
                    Participant B → "Yes"

Phase 2 (Commit):  Coordinator → "Commit!"
                    Participant A → committed
                    Participant B → committed

Problem: If coordinator fails after Phase 1, participants are stuck (blocking protocol).

Saga Pattern#

Break the transaction into local transactions with compensating actions:

1. Order Service: create order ✓
2. Payment Service: charge card ✓
3. Inventory Service: reserve stock ✗ (failed!)
4. Compensate: Payment Service: refund card
5. Compensate: Order Service: cancel order

Better for microservices — no distributed locks, no coordinator bottleneck.

Failure Modes#

Crash Failure#

Server stops completely. Easiest to handle — health checks detect it, load balancer routes around it.

Byzantine Failure#

Server sends wrong/contradictory information. Hardest to handle — need Byzantine fault tolerance (BFT). Rare in controlled environments, common in blockchain.

Partial Failure#

Some components fail while others work. The defining characteristic of distributed systems.

Design principle: Always handle partial failure. Never assume all-or-nothing.

Key Patterns#

Circuit Breaker#

Stop calling a failing service to prevent cascade failures:

Closed (normal) → errors exceed threshold → Open (fail fast)
Open → timer expires → Half-Open (try one request)
Half-Open → success → Closed
Half-Open → failure → Open

Tools: Resilience4j, Polly, Istio

Retry with Exponential Backoff#

Attempt 1: wait 100ms
Attempt 2: wait 200ms
Attempt 3: wait 400ms
Attempt 4: wait 800ms (+ jitter)

Always add jitter — without it, all clients retry at the same time (thundering herd).

Idempotency#

Same request sent multiple times produces the same result.

POST /payments { idempotency_key: "abc123", amount: 100 }
→ First call: charges $100, returns payment_id
→ Second call: returns same payment_id (no double charge)

Critical for: Payments, order creation, any operation with side effects.

Bulkhead#

Isolate components so one failure doesn't take down everything:

Thread Pool A (20 threads) → Payment Service
Thread Pool B (10 threads) → Email Service
Thread Pool C (5 threads)  → Analytics Service

If Analytics is slow, only Pool C is affected.

Clocks and Ordering#

Distributed systems have no global clock. Two approaches:

Logical Clocks (Lamport Timestamps)#

Increment a counter with every event. If A→B (A happened before B), then timestamp(A) < timestamp(B).

Vector Clocks#

Each node maintains a vector of counters. Can detect concurrent events (neither happened before the other).

Used by: DynamoDB (conflict detection), Riak

Hybrid Logical Clocks (HLC)#

Combine physical time with logical counters for causally-ordered timestamps.

Used by: CockroachDB, YugabyteDB

Architecture Example#

Distributed E-Commerce#

Client → API Gateway (L7 LB)
              → Order Service → PostgreSQL (primary + replica)
              → Payment Service → Stripe API
              → Inventory Service → Redis + PostgreSQL
              → Notification Service → Kafka → Email/Push workers

Cross-cutting:
  → Service Mesh (Envoy) for circuit breaking, retries, mTLS
  → Distributed Tracing (Jaeger) for request tracking
  → Centralized Logging (ELK) for debugging

Generate a distributed architecture →

Summary#

Concept	Key Takeaway
CAP Theorem	Choose CP or AP per use case, not globally
Consistency	Strong for money, eventual for feeds
Consensus	Raft is the default — use etcd or similar
Transactions	Sagas over 2PC for microservices
Failures	Design for partial failure, always
Retries	Exponential backoff + jitter + idempotency
Circuit Breaker	Fail fast to prevent cascades
Clocks	No global time — use logical or vector clocks

Design distributed systems at codelit.io — generate interactive architecture diagrams, simulate failures with Chaos Mode, and export infrastructure code.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI agents

Build this architecture

Generate an interactive architecture for Distributed Systems Fundamentals Every Developer Should Know in seconds.

Try it in Codelit →

Distributed Systems Fundamentals Every Developer Should Know

Distributed Systems Fundamentals Every Developer Should Know#

The Eight Fallacies of Distributed Computing#

CAP Theorem#

In Practice#

Consistency Models#

1. Strong Consistency (Linearizability)#

2. Sequential Consistency#

3. Causal Consistency#

4. Eventual Consistency#

Which to Choose?#

Consensus Algorithms#

Raft (Most Popular)#

Paxos#

Distributed Transactions#

Two-Phase Commit (2PC)#

Saga Pattern#

Failure Modes#

Crash Failure#

Byzantine Failure#

Partial Failure#

Key Patterns#

Circuit Breaker#

Retry with Exponential Backoff#

Idempotency#

Bulkhead#

Clocks and Ordering#

Logical Clocks (Lamport Timestamps)#

Vector Clocks#

Hybrid Logical Clocks (HLC)#

Architecture Example#

Distributed E-Commerce#

Summary#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture

Production AI Agent Deployment Checklist

Try these templates

Netflix Video Streaming Architecture

Distributed Rate Limiter

Search Engine Architecture

Build this architecture

Distributed Systems Fundamentals Every Developer Should Know

Distributed Systems Fundamentals Every Developer Should Know#

The Eight Fallacies of Distributed Computing#

CAP Theorem#

In Practice#

Consistency Models#

1. Strong Consistency (Linearizability)#

2. Sequential Consistency#

3. Causal Consistency#

4. Eventual Consistency#

Which to Choose?#

Consensus Algorithms#

Raft (Most Popular)#

Paxos#

Distributed Transactions#

Two-Phase Commit (2PC)#

Saga Pattern#

Failure Modes#

Crash Failure#

Byzantine Failure#

Partial Failure#

Key Patterns#

Circuit Breaker#

Retry with Exponential Backoff#

Idempotency#

Bulkhead#

Clocks and Ordering#

Logical Clocks (Lamport Timestamps)#

Vector Clocks#

Hybrid Logical Clocks (HLC)#

Architecture Example#

Distributed E-Commerce#

Summary#

Comments

Related articles

Context Engineering for Agentic Systems

AI Agent Memory Architecture