10 System Design Mistakes That Sink Your Interview
System design interviews reward clear thinking and practical trade-off analysis. Yet candidates repeatedly fall into the same traps — not because they lack knowledge, but because they skip fundamentals while chasing complexity. Here are the ten most common mistakes and how to avoid every one of them.
1. Over-Engineering#
The most frequent mistake is designing for a scale the system will never reach. A candidate asks for 1,000 daily active users and immediately proposes Kafka, Kubernetes, and a microservice mesh.
The Problem#
Over-engineering adds operational complexity without delivering value. Every additional component is a surface area for failures, a dependency to version, and a concept the team must understand.
The Fix#
- Start simple. A monolith with a single database handles most MVPs.
- Identify bottlenecks first. Only add complexity where the numbers demand it.
- State your assumptions. "At 1K DAU we don't need message queues, but at 1M DAU we would add Kafka here."
1K DAU → Monolith + PostgreSQL + Redis cache
100K DAU → Load balancer + read replicas + CDN
1M DAU → Microservices + message queue + sharding
Interviewers reward candidates who scale incrementally rather than starting at planet scale.
2. Premature Optimization#
Premature optimization is over-engineering's cousin. It shows up when candidates spend interview minutes optimizing a query path that handles 10 requests per second.
The Problem#
Time spent optimizing non-bottlenecks is time not spent on architecture, failure modes, and trade-offs — the things interviewers actually score.
The Fix#
- Profile before optimizing. Use back-of-the-envelope math to identify the actual bottleneck.
- Optimize the critical path. If 80% of traffic hits one endpoint, optimize that endpoint.
- Name the trade-off. "I could add a bloom filter here, but at our current scale a simple index suffices."
3. Ignoring the CAP Theorem#
Candidates design distributed systems that implicitly assume strong consistency, high availability, and partition tolerance — all three simultaneously.
The Problem#
The CAP theorem guarantees you can only have two of three during a network partition. Ignoring this leads to architectures that silently lose data or become unavailable in production.
The Trade-offs#
| Choice | Behavior During Partition | Example Systems |
|---|---|---|
| CP | Available nodes reject writes to preserve consistency | ZooKeeper, HBase, MongoDB (default) |
| AP | All nodes accept writes; conflicts resolved later | Cassandra, DynamoDB, CouchDB |
The Fix#
- State your consistency model explicitly. "This system is AP — we accept eventual consistency for availability."
- Design conflict resolution. Last-write-wins, vector clocks, or application-level merge.
- Separate concerns. Payment processing needs CP. Social feed can be AP.
4. Single Points of Failure#
A system with one database server, one load balancer, or one DNS provider has a single point of failure (SPOF) that takes down everything when it fails.
The Problem#
Every component fails eventually. Hardware dies, processes crash, networks partition. A SPOF means one failure cascades into total downtime.
The Fix#
BEFORE (SPOF):
Client ──▶ Single LB ──▶ Single Server ──▶ Single DB
AFTER (redundant):
Client ──▶ DNS (multi-provider)
──▶ LB pair (active/passive)
──▶ Server pool (N instances)
──▶ DB cluster (primary + replicas)
- Replicate every layer. Load balancers, application servers, databases, caches.
- Use health checks. Automatically route around failed instances.
- Test failover. A replica that has never been promoted is not a real backup.
5. Missing Rate Limits#
Candidates design APIs without any rate limiting, leaving the system vulnerable to abuse, accidental loops, and denial of service.
The Problem#
A single misbehaving client can exhaust database connections, fill queues, and starve legitimate users. Without rate limits, one bad actor takes down the entire service.
The Fix#
Implement rate limiting at multiple layers:
┌─────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ CDN / │────▶│ API │────▶│ Per-User │────▶│ Per- │
│ Edge │ │ Gateway │ │ Limiter │ │ Resource│
│ (DDoS) │ │ (global)│ │ (token │ │ Limiter │
└─────────┘ └──────────┘ │ bucket) │ └──────────┘
└──────────┘
- Token bucket for per-user limits (smooth, allows bursts).
- Fixed window for global limits (simple, predictable).
- Return 429 with a
Retry-Afterheader so clients back off gracefully. - Rate limit by identity, not just IP — IPs are shared behind NATs.
6. No Monitoring or Observability#
Designing a system without mentioning monitoring is like building a car without a dashboard. You have no idea how fast you are going or when you are about to crash.
The Problem#
Without observability, failures are detected by users, not engineers. Mean time to detection (MTTD) skyrockets, and debugging requires guesswork.
The Fix#
Cover the three pillars of observability:
| Pillar | Purpose | Tools |
|---|---|---|
| Metrics | Track throughput, latency, error rates, saturation | Prometheus, Datadog, CloudWatch |
| Logs | Record discrete events for debugging | ELK stack, Loki, CloudWatch Logs |
| Traces | Follow a request across services | Jaeger, Zipkin, OpenTelemetry |
Mention alerting too — dashboards that nobody watches are useless. Set alerts on SLO thresholds: "If p99 latency exceeds 500ms for 5 minutes, page the on-call engineer."
7. Tight Coupling#
Tight coupling means one service cannot be deployed, scaled, or understood without another. It turns a distributed system into a distributed monolith.
The Problem#
- Deploying service A requires coordinated deployment of service B.
- A failure in service B cascades into service A.
- Teams cannot work independently.
The Fix#
- Communicate through contracts, not shared databases. Use APIs, events, or message queues.
- Apply the dependency inversion principle. Services depend on abstractions (interfaces), not concrete implementations.
- Use asynchronous messaging for non-critical paths. If the notification service is down, the order service should still process orders.
TIGHT COUPLING:
Order Service ──▶ direct DB read ──▶ Inventory DB
LOOSE COUPLING:
Order Service ──▶ Inventory API ──▶ Inventory Service ──▶ Inventory DB
│
└──▶ Event Bus ──▶ Notification Service
8. Skipping the Cache#
Candidates go straight to database scaling (sharding, read replicas) without first adding a caching layer that could eliminate 80% of database reads.
The Problem#
Databases are optimized for durability, not read latency. A cache hit returns in sub-millisecond time; a database query takes 5–50ms. At scale, the difference is the difference between a responsive app and a slow one.
The Fix#
Apply caching at every layer:
| Layer | What to Cache | TTL |
|---|---|---|
| CDN | Static assets, public API responses | Minutes to hours |
| Application | Session data, feature flags | Seconds to minutes |
| Database | Query results, computed aggregates | Seconds |
| Client | API responses, images | Varies |
Use cache-aside (lazy loading) for most cases. Use write-through when you need strong consistency between cache and database. Always plan for cache invalidation — it is genuinely one of the hardest problems in computer science.
9. Wrong Database Choice#
Choosing a database without analyzing the access patterns leads to performance problems that no amount of optimization can fix.
The Problem#
| Anti-Pattern | Consequence |
|---|---|
| Relational DB for social graph traversals | Expensive recursive joins |
| Document DB for complex transactions | No ACID guarantees across documents |
| Key-value store for ad-hoc queries | No secondary indexes |
| Single database for all workloads | Read/write contention, schema bloat |
The Fix#
Match the database to the workload:
Structured data + transactions → PostgreSQL / MySQL
Document-oriented + flexible schema → MongoDB / DynamoDB
Graph traversals → Neo4j / Neptune
Time-series data → TimescaleDB / InfluxDB
Full-text search → Elasticsearch / OpenSearch
Caching + ephemeral data → Redis / Memcached
In interviews, it is perfectly valid — and often optimal — to use multiple databases. "We use PostgreSQL for orders (ACID), Redis for sessions (speed), and Elasticsearch for product search (full-text)" demonstrates mature thinking.
10. No Disaster Recovery Plan#
Candidates design systems that work perfectly under normal conditions but have no plan for regional outages, data corruption, or catastrophic failures.
The Problem#
Without disaster recovery (DR), a single region outage means total downtime. Data corruption without backups means permanent data loss.
The Fix#
- Define RPO and RTO. Recovery Point Objective (how much data loss is acceptable) and Recovery Time Objective (how long downtime is acceptable).
- Automate backups. Daily snapshots, continuous replication, and cross-region copies.
- Test recovery regularly. A backup that has never been restored is not a backup.
- Design for multi-region when the business requires high availability.
RPO = 0 (no data loss) → Synchronous cross-region replication
RPO = 1 hour → Asynchronous replication + hourly snapshots
RPO = 24 hours → Daily backups to a separate region
RTO = minutes → Active-active multi-region
RTO = hours → Warm standby in secondary region
RTO = days → Cold backups + manual restoration
Quick Reference Checklist#
Before you finish any system design answer, scan this list:
- Did I start simple and scale incrementally?
- Did I identify the actual bottleneck before optimizing?
- Did I state my consistency model (CP vs AP)?
- Is every component redundant?
- Did I add rate limiting?
- Did I mention monitoring, logging, and alerting?
- Are my services loosely coupled?
- Did I add caching before scaling the database?
- Did I choose the right database for each workload?
- Did I address disaster recovery (RPO/RTO)?
Key Takeaways#
These ten mistakes share a common root: skipping trade-off analysis. System design interviews are not about building the most complex architecture. They are about demonstrating that you understand the trade-offs at every layer and can make deliberate, justified decisions.
Start simple. Add complexity only where the requirements demand it. Explain every choice. That is what separates a senior engineer from someone who memorized architecture diagrams.
This is article #380 in the Codelit system design series. Want to level up your system design skills? Explore the full collection at codelit.io.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
AI Architecture Review
Get an AI audit covering security gaps, bottlenecks, and scaling risks
90+ Templates
Practice with real-world architectures — Uber, Netflix, Slack, and more
Related articles
Try these templates
Uber Real-Time Location System
Handles 5M+ GPS pings per second using H3 hexagonal geospatial indexing.
6 componentsE-Commerce Checkout System
Production checkout flow with Stripe payments, inventory management, and fraud detection.
11 componentsNotification System
Multi-channel notification platform with preferences, templating, and delivery tracking.
9 componentsBuild this architecture
Generate an interactive architecture for 10 System Design Mistakes That Sink Your Interview in seconds.
Try it in Codelit →
Comments