DNS-Based Load Balancing — Distributing Traffic at the Edge
What is DNS-based load balancing?#
DNS-based load balancing distributes traffic across multiple servers by returning different IP addresses in response to DNS queries. Instead of a single A record pointing to one server, the DNS resolver returns one or more addresses from a pool — steering clients to different endpoints before they even open a TCP connection.
This happens at the very edge of the request path, before any application-layer load balancer gets involved.
Why use DNS for load balancing?#
- Global distribution — route users to the nearest data centre without a central proxy
- No single point of failure — DNS is inherently distributed
- Layer independence — works with any protocol (HTTP, gRPC, TCP, UDP)
- Cost efficiency — no dedicated load balancer hardware for cross-region routing
- Massive scale — DNS handles billions of queries per day with minimal overhead
Round-robin DNS#
The simplest form of DNS load balancing. The authoritative DNS server rotates through a list of IP addresses, returning them in a different order for each query.
How it works#
; Zone file
api.example.com. 300 IN A 203.0.113.1
api.example.com. 300 IN A 203.0.113.2
api.example.com. 300 IN A 203.0.113.3
Query 1 returns: 203.0.113.1, 203.0.113.2, 203.0.113.3 Query 2 returns: 203.0.113.2, 203.0.113.3, 203.0.113.1 Query 3 returns: 203.0.113.3, 203.0.113.1, 203.0.113.2
Limitations#
- No health awareness — if a server goes down, DNS keeps returning its IP
- Uneven distribution — DNS caching means many clients share the same resolved IP
- No session affinity — consecutive requests from the same client may hit different servers
- Client behaviour varies — some clients always use the first IP, others pick randomly
Weighted DNS#
Weighted DNS assigns a weight to each record, controlling the probability that a particular IP is returned. This allows gradual traffic shifting — useful for canary deployments, capacity-proportional routing, and migrations.
Configuration example (Route 53)#
{
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "primary",
"Weight": 80,
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.1"}]
}
{
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "canary",
"Weight": 20,
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.2"}]
}
80% of DNS responses return the primary IP, 20% return the canary. Adjusting weights enables zero-downtime migrations.
GeoDNS#
GeoDNS returns different IP addresses based on the geographic location of the DNS resolver (or the client, if EDNS Client Subnet is supported).
Use cases#
- Latency reduction — route European users to Frankfurt, US users to Virginia
- Data sovereignty — keep EU data in EU regions to comply with GDPR
- Content localisation — serve region-specific content from nearby servers
How location is determined#
- Resolver IP — the DNS server maps the recursive resolver's IP to a geographic location using a GeoIP database
- EDNS Client Subnet (ECS) — the resolver forwards a truncated version of the client's IP, giving the authoritative server a more accurate location
Limitations#
- GeoIP databases are imperfect — corporate VPNs and public DNS resolvers (8.8.8.8) can mislocate users
- ECS adoption is not universal
- Geographic proximity does not always equal network proximity
Latency-based routing#
Latency-based routing goes beyond geography. Instead of mapping IPs to locations, it measures actual network latency between clients and endpoints, then returns the IP with the lowest latency.
How Route 53 implements it#
- AWS continuously measures latency between its edge locations and each AWS region
- When a query arrives, Route 53 identifies the edge location closest to the resolver
- It returns the record associated with the lowest-latency region from that edge location
{
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "us-east-1",
"Region": "us-east-1",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.1"}]
}
{
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "eu-west-1",
"Region": "eu-west-1",
"TTL": 60,
"ResourceRecords": [{"Value": "203.0.113.2"}]
}
Latency-based vs GeoDNS#
| Aspect | GeoDNS | Latency-based |
|---|---|---|
| Routing signal | Geographic location | Measured network latency |
| Accuracy | Good for most cases | Better for edge cases (VPNs, anycast) |
| Setup complexity | Moderate | Lower (cloud-managed) |
| Provider examples | NS1, Cloudflare | Route 53, Azure Traffic Manager |
Health-check integration#
DNS load balancing without health checks is dangerous — it sends traffic to dead servers. Modern DNS providers integrate health checks directly into the routing decision.
Health check flow#
- The DNS provider periodically sends health-check probes to each endpoint (HTTP, HTTPS, TCP, or custom)
- If an endpoint fails consecutive checks, it is marked unhealthy
- The DNS server stops returning the unhealthy endpoint's IP
- When the endpoint recovers, it is gradually re-added
Configuration considerations#
- Check interval — 10–30 seconds is typical. Faster detection means more probe traffic.
- Failure threshold — require 2–3 consecutive failures before marking unhealthy to avoid flapping
- Recovery threshold — require 2–3 consecutive successes before marking healthy again
- Check path — use a dedicated health endpoint that verifies downstream dependencies
The TTL problem#
Even after DNS removes an unhealthy IP, clients with cached responses continue sending traffic to the dead server until their cached TTL expires. This is the fundamental limitation of DNS-based health checks.
Route 53 failover#
Route 53 offers a dedicated failover routing policy for active-passive setups:
Primary-secondary failover#
Primary record -> 203.0.113.1 (active, health-checked)
Secondary record -> 203.0.113.2 (standby, returned only when primary is unhealthy)
Multi-level failover#
Combine failover with other routing policies:
- Top level — latency-based routing to the nearest region
- Second level — failover within each region (primary AZ to secondary AZ)
- Third level — weighted routing within each AZ for canary deployments
This creates a routing tree where each level handles a different concern.
DNS TTL strategies#
TTL (Time To Live) controls how long resolvers and clients cache a DNS response. It is the single most important parameter in DNS-based load balancing.
Low TTL (30–60 seconds)#
- Pro — fast failover, traffic shifts take effect quickly
- Con — higher query volume to authoritative servers, slightly higher latency for first request after cache expiry
High TTL (300–3600 seconds)#
- Pro — fewer DNS queries, faster resolution for cached clients
- Con — slow failover, stale records served during outages
Recommended strategy#
| Scenario | Recommended TTL |
|---|---|
| Active-passive failover | 30–60 seconds |
| Latency-based routing | 60 seconds |
| Stable multi-region | 300 seconds |
| Static content CDN | 3600 seconds |
| During a migration | 30 seconds (lower before, raise after) |
TTL floor reality#
Many recursive resolvers enforce a minimum TTL (often 30 seconds). Some corporate resolvers cache for much longer regardless of the TTL you set. Plan for worst-case cache staleness, not just the TTL you configure.
Combining DNS with application-layer load balancing#
DNS load balancing works best as the first layer in a multi-tier strategy:
- DNS layer — GeoDNS or latency-based routing directs users to the nearest region
- Edge layer — a regional load balancer (ALB, Envoy) handles TLS termination and HTTP routing
- Service layer — service mesh or internal load balancer distributes traffic across pods
DNS handles coarse-grained, cross-region routing. Application-layer load balancers handle fine-grained, request-level decisions.
Common pitfalls#
- Ignoring DNS caching — always account for cached TTLs in failover time calculations
- No health checks — round-robin DNS without health checks is a ticking time bomb
- Over-relying on DNS — DNS cannot do connection draining, rate limiting, or request-level routing
- Forgetting EDNS Client Subnet — without ECS, GeoDNS accuracy drops for users behind public resolvers
- TTL too high during migrations — lower TTL well before the migration, not during it
398 articles on system design at codelit.io/blog.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Comments