System Design Tradeoffs: The Complete Guide to Engineering Decisions
System Design Tradeoffs#
Every architecture decision is a tradeoff. Senior engineers are not defined by knowing the "right" answer — they are defined by knowing what they are giving up with every choice. This guide covers the fundamental tradeoffs you will face in system design and interviews.
Consistency vs Availability (CAP Theorem)#
The CAP theorem states that in the presence of a network partition, a distributed system must choose between consistency (every read returns the most recent write) and availability (every request gets a response).
| Choose Consistency | Choose Availability |
|---|---|
| Banking, payments | Social media feeds |
| Inventory counts | User sessions |
| Leader election | DNS |
| Distributed locks | Shopping carts |
In practice: Network partitions are rare but real. Most systems choose availability by default and use eventual consistency, accepting a small window where reads may be stale.
Strong consistency:
Write → replicate to all nodes → acknowledge
Tradeoff: Higher latency, lower availability
Eventual consistency:
Write → acknowledge → replicate asynchronously
Tradeoff: Stale reads possible, conflict resolution needed
Interview tip: Never say "I'd use CP" or "I'd use AP" without explaining why for that specific use case. The answer always depends on the business requirement.
Latency vs Throughput#
You can optimize for fast individual requests (latency) or maximum total requests per second (throughput) — rarely both.
Optimizing for latency:
- In-memory caches (Redis)
- Connection pooling
- Edge computing / CDNs
- Synchronous processing
- Fewer network hops
Optimizing for throughput:
- Batch processing
- Async message queues
- Horizontal scaling
- Buffer and flush patterns
- Larger batch sizes
Example tradeoff: A payment API needs low latency (users are waiting). A report generator needs high throughput (process millions of records). Same company, different optimizations.
Low latency path:
User → API → cache hit → response (5ms)
High throughput path:
Queue → batch worker → process 1000 records → write batch (50ms total, 0.05ms each)
SQL vs NoSQL#
This is not about technology preference — it is about data access patterns.
| Factor | SQL (PostgreSQL, MySQL) | NoSQL (MongoDB, DynamoDB) |
|---|---|---|
| Schema | Rigid, enforced | Flexible, schema-on-read |
| Relationships | Joins are natural | Denormalization required |
| Scaling | Vertical (primarily) | Horizontal (designed for it) |
| Consistency | ACID by default | Tunable, often eventual |
| Query flexibility | Ad-hoc queries, aggregations | Optimized for known access patterns |
| Best for | Complex relationships, transactions | High write volume, simple lookups |
The real question: How will you query this data?
- If you need flexible queries across relationships → SQL
- If you have massive scale with simple key-value lookups → NoSQL
- If you need both → use both (polyglot persistence)
Common mistake: Choosing NoSQL because "it scales" when your data is deeply relational. You end up reimplementing joins in application code.
Monolith vs Microservices#
| Factor | Monolith | Microservices |
|---|---|---|
| Complexity | Low (one codebase) | High (distributed system) |
| Deployment | All-or-nothing | Independent per service |
| Scaling | Scale everything together | Scale individual services |
| Data consistency | Transactions are easy | Saga pattern, eventual consistency |
| Team scaling | Harder past ~20 devs | Independent team ownership |
| Debugging | Stack traces | Distributed tracing |
| Latency | Function calls (nanoseconds) | Network calls (milliseconds) |
The progression most successful companies follow:
Monolith → modular monolith → extract high-value services → selective microservices
Do NOT start with microservices unless you have a large team with well-defined domain boundaries. The operational overhead is enormous.
Interview tip: If the interviewer asks you to design a system from scratch, start with a monolith and explain which parts you would extract as services and why.
Sync vs Async#
Synchronous: Caller waits for the response. Simple, predictable, easy to debug.
Asynchronous: Caller sends message and moves on. Decoupled, resilient, higher throughput.
Sync: User → API → process → database → response to user (500ms)
Async: User → API → enqueue → response "accepted" (50ms)
Worker → dequeue → process → database (in background)
Use sync when:
- The user needs the result immediately
- The operation is fast (under 200ms)
- Failure must be communicated instantly
Use async when:
- The operation is slow (sending email, generating reports)
- You need to absorb traffic spikes (queue acts as buffer)
- Services need to be decoupled
- Retries are needed (dead-letter queues)
The hybrid approach: Accept the request synchronously, process it asynchronously, notify via webhook or polling when done.
Simplicity vs Flexibility#
The most underrated tradeoff. Every abstraction layer adds flexibility but also complexity.
Simple: hardcoded config → works now, painful to change
Flexible: plugin system → works for everything, painful to understand
Right amount: config file → covers 90% of cases, readable
YAGNI (You Ain't Gonna Need It): Build for today's requirements. Refactor when new requirements actually arrive. Premature flexibility is a form of technical debt.
Example: A feature flag system.
- Simple:
if (userId in betaUsers)— works for 2 flags - Over-engineered: custom DSL with rule engine — works for 2000 flags
- Right-sized: key-value store with percentage rollout — covers most real use cases
Cost vs Performance#
Cloud costs grow linearly (or worse) with performance improvements:
10ms response → $500/month (in-memory cache, beefy instances)
50ms response → $100/month (standard instances, disk-based)
200ms response → $30/month (minimal resources, cold starts OK)
Questions to ask:
- What latency does the user actually perceive? (Below 100ms feels instant.)
- What is the cost of an outage vs the cost of over-provisioning?
- Can you use spot/preemptible instances for batch workloads?
- Is caching cheaper than scaling compute?
Consistency vs Performance (Caching)#
Caches make systems fast but introduce stale data:
| Strategy | Consistency | Performance | Complexity |
|---|---|---|---|
| No cache | Perfect | Worst | None |
| Cache-aside (TTL) | Eventual | Good | Low |
| Write-through | Strong | Good | Medium |
| Write-behind | Eventual | Best | High |
| Cache invalidation | Strong | Good | High |
The two hard problems in computer science: cache invalidation and naming things. If strong consistency matters, prefer write-through or explicit invalidation over TTL.
How to Discuss Tradeoffs in Interviews#
A framework for any system design question:
1. State the tradeoff explicitly: "We could use a relational database for strong consistency, or a document store for simpler horizontal scaling. Let me evaluate both."
2. Connect to requirements: "Since the requirements mention high write throughput with simple lookups, a document store fits better here."
3. Acknowledge what you are giving up: "The tradeoff is that cross-entity queries become harder. We'd handle that with a separate read model or search index."
4. Propose a migration path: "We can start with PostgreSQL and move hot paths to DynamoDB if we hit scaling limits."
Never say: "We should use X because it is the best." Always say: "X fits here because of Y, and we accept the tradeoff of Z."
The Meta-Tradeoff#
Every tradeoff comes down to one question: what is the cost of being wrong?
- If wrong about consistency → data corruption, financial loss
- If wrong about availability → users see errors, revenue loss
- If wrong about complexity → slow development, bugs
- If wrong about performance → user churn, scaling crisis
Reversible decisions (cache strategy, queue provider) — move fast, optimize later.
Irreversible decisions (database choice, service boundaries) — invest time upfront.
System design is not about memorizing solutions. It is about developing judgment for tradeoffs. The best engineers can articulate what they are giving up — and why it is worth it.
Article #274 of the Codelit engineering series. Browse all articles at codelit.io
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
AI Agent Tool Use Architecture: Function Calling, ReAct Loops & Structured Outputs
6 min read
AI searchAI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG
8 min read
AI safetyAI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop
8 min read
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsNetflix Video Streaming Architecture
Global video streaming platform with adaptive bitrate, CDN distribution, and recommendation engine.
10 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsBuild this architecture
Generate an interactive architecture for System Design Tradeoffs in seconds.
Try it in Codelit →
Comments