Latency Optimization Techniques: From P50 to P99
Latency is the time between a request and its response. In distributed systems, small latency improvements compound across millions of requests and directly affect user experience, conversion rates, and revenue. Every 100ms of added latency costs measurable business outcomes.
Understanding Percentiles#
Averages hide the worst experiences. Percentiles tell the real story:
- P50 (median) — Half of requests are faster than this. Represents the typical experience.
- P95 — 95% of requests are faster. The remaining 5% are your frustrated users.
- P99 — 99% of requests are faster. At scale, 1% means thousands of slow requests per minute.
- P99.9 — The long tail. Often 10x the median and caused by garbage collection, disk I/O, or network retries.
A service handling 10,000 requests per second with a P99 of 500ms means 100 requests every second take over half a second. If those requests belong to paying customers, the impact is real.
Tail Latency Amplification#
In microservice architectures, a single user request fans out to multiple backend services. If any one of those calls hits its P99, the entire request is slow.
For a request that calls N independent services in parallel, the probability that at least one hits its P99:
P(at least one slow) = 1 - (0.99)^N
N = 5 → 5% of requests hit a tail
N = 10 → 10% of requests hit a tail
N = 50 → 39% of requests hit a tail
This is why tail latency matters more than median latency in distributed systems.
Caching#
Caching is the single most effective latency optimization. The fastest request is the one you never make.
Cache layers (ordered by proximity to the user):
- Browser cache — Static assets, API responses with Cache-Control headers.
- CDN cache — HTML, images, API responses at edge locations.
- Application cache — In-memory (local HashMap/LRU) or distributed (Redis, Memcached).
- Database query cache — Materialized views, query result caching.
Cache strategies:
- Cache-aside — Application checks cache first, falls back to database, then populates cache. Simple and flexible.
- Write-through — Writes go to cache and database simultaneously. Ensures cache consistency at the cost of write latency.
- Write-behind — Writes go to cache immediately, asynchronously flushed to database. Low write latency but risk of data loss.
- Read-through — Cache itself fetches from the database on a miss. Simplifies application logic.
Cache invalidation is the hard part. Common approaches:
- TTL-based expiration — simple but allows stale reads.
- Event-driven invalidation — publish cache-bust events on writes.
- Version keys — append a version number to cache keys; increment on write.
Pre-Computation#
Move work from the request path to background jobs:
- Materialized views — Pre-join and aggregate data so reads hit a single denormalized table.
- Pre-computed feeds — Fan-out-on-write: when a user posts, push the content into followers' pre-built timelines.
- Warming caches — Populate caches before traffic arrives (after deployments or predicted traffic spikes).
- Static site generation — Render pages at build time instead of on every request.
The trade-off is always freshness vs. speed. Pre-computed data may be stale, so define acceptable staleness windows for each use case.
CDN and Edge Computing#
A CDN places content closer to users, reducing round-trip time:
- Static assets — Images, CSS, JavaScript. Cache at the edge with long TTLs and content-hash filenames for cache busting.
- Dynamic content at the edge — Edge functions (Cloudflare Workers, Vercel Edge Functions) run compute in the CDN PoP closest to the user.
- API response caching — Cache GET responses with appropriate Vary headers and short TTLs.
Key metrics:
- Cache hit ratio — Target above 90% for static content.
- Origin shield — An intermediate cache layer that reduces origin load.
- Time-to-first-byte (TTFB) — Measures how quickly the first byte reaches the client.
Connection Reuse#
Establishing connections is expensive. Reuse them:
- HTTP keep-alive — Reuse TCP connections across multiple HTTP requests. Default in HTTP/1.1.
- HTTP/2 multiplexing — Multiple requests share a single TCP connection, eliminating head-of-line blocking at the HTTP layer.
- Connection pooling — Database connection pools (PgBouncer, HikariCP) avoid the overhead of TCP handshake + TLS negotiation + authentication per query.
- gRPC persistent connections — gRPC uses HTTP/2 under the hood; a single connection handles thousands of concurrent RPCs.
DNS resolution caching is often overlooked. Cache DNS lookups at the application level to avoid repeated resolution.
Async Processing#
Not everything needs to happen in the request path. Move non-critical work to background queues:
- Message queues — Kafka, SQS, RabbitMQ. Decouple producers from consumers.
- Event-driven processing — Emit an event ("order.placed") and let downstream services handle email, analytics, and inventory updates asynchronously.
- Deferred writes — Buffer writes in memory or a queue and flush to the database in batches.
- Webhooks over polling — Push notifications to clients instead of letting them poll repeatedly.
The user sees a fast response ("Your order is confirmed"), and the heavy lifting happens in the background.
Measurement and Profiling#
You cannot optimize what you do not measure:
- Distributed tracing (Jaeger, Zipkin, Datadog APM) — See where time is spent across service boundaries.
- Flame graphs — Visualize CPU time by call stack to find hot paths.
- Continuous profiling (Pyroscope, Parca) — Profile in production without significant overhead.
- Synthetic monitoring — Simulate user requests from multiple regions to track latency trends.
Set latency budgets per endpoint. If your SLO is P99 under 200ms, break that budget across each downstream call and enforce it with timeouts.
Practical Checklist#
- Measure percentiles (P50, P95, P99) — not averages.
- Add caching at the layer closest to the user that still meets freshness requirements.
- Pre-compute expensive reads and denormalize hot paths.
- Use a CDN for static and cacheable dynamic content.
- Enable connection pooling and HTTP/2 everywhere.
- Move non-critical work off the request path with async queues.
- Set per-service timeout budgets to contain tail latency.
- Profile continuously — latency regressions creep in with every deploy.
Key Takeaways#
- Percentiles reveal what averages hide. Optimize for P99, not the median.
- Tail latency amplification means a single slow dependency can degrade every request in a fanout architecture.
- Caching is the highest-leverage optimization. Layer caches from browser to database.
- Pre-computation trades write-time work and freshness for read-time speed.
- Connection reuse eliminates repeated handshake overhead that adds up at scale.
- Async processing keeps the request path lean by deferring non-critical work.
- Measurement is not optional. Without distributed tracing and percentile tracking, you are optimizing blind.
Fast systems are not built by accident. They are the result of deliberate measurement, targeted caching, and disciplined separation of critical-path work from background processing.
Build and explore system design concepts hands-on at codelit.io.
292 articles on system design at codelit.io/blog.
Try it on Codelit
AI Architecture Review
Get an AI audit covering security gaps, bottlenecks, and scaling risks
Related articles
Build this architecture
Generate an interactive architecture for Latency Optimization Techniques in seconds.
Try it in Codelit →
Comments