DNS Architecture: How Domain Resolution Works at Scale
DNS Architecture: How Domain Resolution Works at Scale#
Every request starts with DNS. Before your browser connects to any server, it must resolve a domain name (codelit.io) to an IP address (104.21.32.15). This lookup happens billions of times per day across the internet.
Understanding DNS architecture is essential for system design — it's where latency starts, where failover happens, and where global routing decisions are made.
How DNS Resolution Works#
1. Browser: "What's the IP for codelit.io?"
2. Browser cache → miss
3. OS resolver cache → miss
4. Recursive resolver (ISP or 1.1.1.1) → miss
5. Root nameserver → "Ask .io TLD server"
6. .io TLD nameserver → "Ask ns1.cloudflare.com"
7. Authoritative nameserver → "codelit.io = 104.21.32.15"
8. Recursive resolver caches result (TTL: 300s)
9. Browser connects to 104.21.32.15
Total time: 20-100ms for uncached. < 1ms for cached.
DNS Record Types#
| Record | Purpose | Example |
|---|---|---|
| A | Domain → IPv4 address | codelit.io → 104.21.32.15 |
| AAAA | Domain → IPv6 address | codelit.io → 2606:4700::6812:200f |
| CNAME | Domain → another domain (alias) | www.codelit.io → codelit.io |
| MX | Mail server for domain | codelit.io → mail.google.com (priority 10) |
| TXT | Arbitrary text (verification, SPF) | v=spf1 include:_spf.google.com ~all |
| NS | Nameserver for domain | codelit.io → ns1.cloudflare.com |
| SOA | Start of authority (zone metadata) | Serial, refresh, retry, expire |
| SRV | Service location (port + host) | _sip._tcp.example.com → sip.example.com:5060 |
| CAA | Certificate authority authorization | codelit.io → letsencrypt.org |
TTL Strategies#
TTL (Time To Live) controls how long DNS responses are cached:
| TTL | Use Case | Trade-off |
|---|---|---|
| 60s | Failover, dynamic routing | More DNS queries, faster propagation |
| 300s (5 min) | Default for most sites | Good balance |
| 3600s (1 hour) | Stable services | Fewer queries, slower changes |
| 86400s (1 day) | Rarely-changing records (MX, TXT) | Minimal queries, slow propagation |
Before a migration: Lower TTL to 60s 24 hours before. After migration, restore normal TTL.
DNS for High Availability#
GeoDNS (Geographic Routing)#
Route users to the nearest data center:
User (Tokyo) → DNS → 35.194.100.1 (Asia PoP)
User (NYC) → DNS → 34.102.200.1 (US East PoP)
User (London) → DNS → 35.227.150.1 (EU PoP)
How it works: DNS server checks the resolver's IP location and returns the nearest server IP.
Tools: Cloudflare, Route 53, NS1
DNS Failover#
Automatically remove unhealthy servers:
Normal: api.example.com → [server-1 (healthy), server-2 (healthy)]
Failure: api.example.com → [server-2 (healthy)]
(server-1 removed after health check fails)
Health checks: DNS provider pings each server every 30s. Removes unhealthy IPs from responses.
Weighted DNS#
Distribute traffic by percentage:
api.example.com:
70% → server-1 (primary)
20% → server-2 (secondary)
10% → server-3 (canary)
Use case: Canary deploys, gradual migrations, A/B testing.
Multi-value Routing#
Return multiple IPs and let the client choose:
api.example.com → [10.0.1.1, 10.0.2.1, 10.0.3.1]
Client picks one randomly → basic load balancing
Simpler than a load balancer, but no health awareness at the client.
DNS Security#
DNSSEC#
Cryptographically sign DNS records to prevent spoofing:
Resolver → asks for codelit.io A record
Authoritative → returns record + RRSIG (signature)
Resolver → verifies signature with public key → trusted!
Without DNSSEC, attackers can poison DNS caches and redirect traffic.
DNS over HTTPS (DoH) / DNS over TLS (DoT)#
Encrypt DNS queries to prevent ISP snooping:
Traditional: DNS query in plaintext → ISP can see every domain you visit
DoH: DNS query over HTTPS → encrypted, ISP sees nothing
Resolvers with DoH: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9)
CAA Records#
Control which CAs can issue certificates for your domain:
codelit.io CAA 0 issue "letsencrypt.org"
codelit.io CAA 0 issue "cloudflare.com"
Prevents unauthorized certificate issuance.
DNS at Scale — Real Examples#
Netflix#
- Route 53 for GeoDNS routing to nearest CDN
- TTL: 60s for fast failover between regions
- Health checks every 10s on all endpoints
- Weighted routing for gradual region migrations
Cloudflare#
- Anycast DNS — same IP announced from 300+ PoPs worldwide
- User's BGP route determines which PoP handles the query
- < 10ms DNS resolution globally
GitHub#
- Multiple A records for load distribution
- CNAME for Pages —
username.github.io→ GitHub's CDN - CAA records for Let's Encrypt and DigiCert only
Architecture Patterns#
Simple Web App DNS#
example.com A → Load Balancer IP
www.example.com CNAME → example.com
api.example.com A → API Load Balancer IP
mail.example.com MX → Google Workspace
Multi-Region with Failover#
app.example.com → GeoDNS
US users → us-east LB (primary)
→ us-west LB (failover)
EU users → eu-west LB (primary)
→ us-east LB (failover)
Health checks: every 30s
Failover TTL: 60s
Microservices Internal DNS#
Kubernetes internal:
user-service.default.svc.cluster.local → ClusterIP
order-service.default.svc.cluster.local → ClusterIP
CoreDNS resolves service names to pod IPs
No external DNS needed for internal traffic
Common Mistakes#
- Forgetting to lower TTL before migrations — changes take hours to propagate
- CNAME at apex —
example.com CNAMEdoesn't work (use ALIAS/ANAME or flattening) - No CAA records — anyone can issue certificates for your domain
- Single nameserver — always use at least 2 NS providers
- Ignoring DNSSEC — DNS spoofing is a real attack vector
Summary#
- DNS is the first thing that happens — optimize it
- GeoDNS for global routing to nearest data center
- Health check + failover for automatic recovery
- Low TTL (60s) before migrations, normal TTL (300s) after
- DNSSEC + CAA for security
- Internal DNS (CoreDNS/Consul) for service discovery in K8s
Design your DNS and networking architecture at codelit.io — generate interactive diagrams with infrastructure exports.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
Related articles
Try these templates
Slack-like Team Messaging
Workspace-based team messaging with channels, threads, file sharing, and integrations.
9 componentsWhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsGmail-Scale Email Service
Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.
10 components
Comments