Design a Web Crawler — Architecture, Politeness, and Scale
What makes crawling hard#
A web crawler sounds simple: fetch pages, extract links, repeat. But at scale, you face rate limiting, duplicate URLs, infinite loops, dynamic content, politeness rules, and petabytes of storage.
Google crawls billions of pages. Even a modest crawler for a company's internal tools needs careful architecture.
Core components#
URL Frontier#
A priority queue of URLs to crawl. Not a simple FIFO — priorities matter:
- Freshness priority — News sites get crawled every few minutes, static pages monthly
- Importance priority — High PageRank pages first
- Politeness queues — Separate queue per domain to avoid overwhelming servers
Fetcher#
HTTP client that downloads pages. Key considerations:
- Robots.txt — Check and respect rules before crawling any domain
- Crawl delay — Honor the delay specified in robots.txt
- DNS resolution — Cache DNS lookups to avoid bottlenecks
- Timeout handling — Don't wait forever for slow servers
Content Parser#
Extract useful data from downloaded pages:
- Link extraction — Find all URLs in the page (href, src, srcset)
- Content extraction — Strip HTML, extract text, metadata, images
- Duplicate detection — SimHash or MinHash to detect near-duplicate content
- Content type — Handle HTML, PDF, XML, JSON differently
URL Deduplication#
The same page can be reached via many URLs. Normalize and deduplicate:
https://example.com/page
https://example.com/page/
https://example.com/page?ref=twitter
https://www.example.com/page
→ All normalize to: https://example.com/page
Use a Bloom filter or hash set to check "have I seen this URL before?" in O(1).
Architecture at scale#
URL Frontier (Redis/Kafka)
↓
Fetcher Workers (100+ parallel)
↓
Content Parser
↓
Link Extractor → URL Dedup → URL Frontier (loop)
↓
Content Store (S3/HDFS)
↓
Indexer
Distributed crawling#
Split work across machines:
- Partition by domain — Each worker handles specific domains (consistent hashing)
- Centralized frontier — Kafka topic or Redis queue shared by all workers
- Dedup service — Shared Bloom filter or distributed hash table
Politeness policies#
Critical. An impolite crawler gets blocked, rate-limited, or causes legal problems.
- Respect robots.txt — Always check before crawling
- Crawl delay — Wait between requests to same domain (1-10 seconds)
- Identify yourself — Set a descriptive User-Agent with contact info
- Back off on errors — Exponential backoff on 429/5xx responses
- Limit concurrent connections — Max 1-2 per domain at a time
Handling traps#
Spider traps — Infinite URLs generated dynamically:
/calendar/2026/03/24
/calendar/2026/03/25
/calendar/2026/03/26
... infinite dates
Solutions:
- Maximum URL depth limit (e.g., 15 path segments)
- Maximum pages per domain per crawl
- URL pattern detection (detect repeating structures)
- Domain-level timeout
Storage estimation#
Crawling 1 billion pages:
- Average page: 100KB compressed
- Total storage: 100TB
- URLs to track: ~10 billion (10x pages due to duplicates)
- URL storage: ~500GB (50 bytes per URL hash)
Visualize your crawler architecture#
See how the URL frontier, fetchers, parsers, and storage connect — try Codelit to generate an interactive diagram of a web crawler system.
Key takeaways#
- URL Frontier is the brain — priority queue with politeness partitioning
- Bloom filters for O(1) URL deduplication with minimal memory
- Robots.txt is not optional — always respect it
- Partition by domain for distributed crawling
- SimHash for near-duplicate detection — catch pages with minor differences
- Spider traps will break naive crawlers — depth limits are essential
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Cost Estimator
See estimated AWS monthly costs for every component in your architecture
90+ Templates
Practice with real-world architectures — Uber, Netflix, Slack, and more
Related articles
Try these templates
WhatsApp-Scale Messaging System
End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.
9 componentsFigma Collaborative Design Platform
Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.
10 componentsGmail-Scale Email Service
Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.
10 components
Comments