Web Crawler Architecture: Designing a Scalable Internet Crawler
Search engines, price comparison tools, and AI training pipelines all depend on web crawlers. Designing a crawler that can fetch billions of pages while respecting site policies and handling failures is a foundational system design problem. This guide walks through the complete architecture — from URL frontier to distributed crawling at scale.
High-Level Architecture#
A web crawler consists of four core components operating in a loop:
URL Frontier → Fetcher → Parser → Content Store
↑ |
└──── New URLs ─────────┘
- URL Frontier — a priority queue of URLs to crawl.
- Fetcher — downloads pages via HTTP.
- Parser — extracts content and discovers new URLs.
- Content Store — persists raw HTML and extracted data.
The parser feeds discovered URLs back into the frontier, creating the crawl loop.
The URL Frontier#
The frontier is far more than a simple queue. It enforces politeness, priority, and freshness.
Priority Queue#
Not all URLs are equally important. Assign priority based on:
- PageRank or domain authority — crawl high-value sites first.
- Update frequency — news sites change hourly; corporate about pages change yearly.
- Depth — pages closer to the root tend to be more important.
Use a multi-level priority queue where each level drains at a different rate.
Politeness Enforcement#
A single-threaded per-host queue ensures the crawler never hammers one server. The frontier maps each host to its own FIFO queue and enforces a minimum delay between requests to the same host.
frontier/
priority_queue → host_router → per_host_queues
├── example.com [url1, url2]
├── news.org [url3]
└── shop.io [url4, url5]
Back Queue vs Front Queue#
A common design splits the frontier into:
- Front queues — multiple queues bucketed by priority.
- Back queues — one queue per host, enforcing politeness.
A selector pulls from front queues by priority, then routes to the appropriate back queue.
Respecting robots.txt#
Before crawling any host, fetch and cache its /robots.txt. This file specifies:
- Which paths are disallowed for your user-agent.
- A
Crawl-delaydirective (seconds between requests). - Sitemap locations for discovery.
Cache robots.txt with a TTL of 24 hours. If the file is unavailable (HTTP 5xx), back off and retry. If it returns 404, assume full access.
The Fetcher#
The fetcher downloads pages and handles the messy realities of the web.
HTTP Client Configuration#
- Timeout — 30 seconds for connect, 60 seconds for read.
- Retries — exponential backoff with jitter, max 3 attempts.
- User-Agent — identify your crawler honestly (e.g.,
CodelitBot/1.0). - Redirect handling — follow up to 5 redirects, then abort.
- Compression — request gzip/brotli to reduce bandwidth.
Handling JavaScript-Rendered Pages#
Modern SPAs render content client-side. A basic HTTP fetch returns an empty shell.
Option 1: Headless Browser Use Puppeteer or Playwright to render pages. Expensive — each render takes 2–5 seconds and significant memory. Reserve for high-value domains.
Option 2: Dynamic Rendering Service A service like Rendertron sits between the fetcher and the web. It detects JS-heavy pages and renders them on demand, caching the output.
Option 3: Hybrid Approach First attempt a static fetch. If the HTML body is suspiciously small (< 1KB) or lacks expected content markers, re-fetch with a headless browser.
The Parser#
The parser extracts two things: content and links.
Content Extraction#
- Strip boilerplate (nav, footer, ads) using readability algorithms or DOM-based heuristics.
- Extract structured data (JSON-LD, microdata, Open Graph tags).
- Detect content type and language.
- Compute a content fingerprint (SimHash or MinHash) for near-duplicate detection.
Link Extraction#
- Resolve relative URLs to absolute.
- Normalize URLs: lowercase the scheme and host, remove default ports, sort query parameters, strip fragments.
- Filter out non-HTML resources (images, PDFs) unless explicitly wanted.
URL Deduplication with Bloom Filters#
A crawler at scale encounters the same URL millions of times. Storing every seen URL in a hash set consumes enormous memory.
Bloom Filters#
A Bloom filter is a space-efficient probabilistic data structure that answers: "Have I seen this URL before?"
- False positives — occasionally says "yes" when the answer is "no" (causes a missed crawl — acceptable).
- False negatives — never happen (no duplicate crawls).
A Bloom filter with 10 billion entries and a 1% false positive rate requires about 1.2 GB of memory — far less than storing the actual URLs.
Partitioned Bloom Filters#
Shard the Bloom filter across crawler nodes by hashing the URL and routing to the responsible shard. Each shard maintains its own filter.
Distributed Crawling#
A single machine can fetch roughly 100–200 pages per second. To crawl billions of pages, distribute the work.
Architecture#
Coordinator
├── Crawler Node 1 (hosts a–f)
├── Crawler Node 2 (hosts g–m)
├── Crawler Node 3 (hosts n–s)
└── Crawler Node 4 (hosts t–z)
Host-Based Partitioning#
Assign each crawler node a range of hostnames (by consistent hashing). This ensures:
- Only one node crawls a given host — no duplicate politeness tracking.
- Per-host state (robots.txt cache, crawl delay timers) lives on one machine.
- Rebalancing on node failure uses the consistent hash ring.
Coordination#
Use a distributed message queue (Kafka) for URL distribution. Discovered URLs are published to a topic, partitioned by host hash. Each crawler node consumes its assigned partitions.
Priority Scheduling#
Static Priority#
Assign scores at URL discovery time based on domain authority, path depth, and content type.
Dynamic Priority#
Adjust priority based on crawl history:
- Pages that changed since last crawl get higher priority.
- Pages that returned errors get deprioritized.
- High-engagement pages (from analytics data) get boosted.
Freshness Scheduling#
Estimate each page's change frequency. Use a Poisson process model:
next_crawl = last_crawl + 1 / estimated_change_rate
Pages that change daily get recrawled daily. Pages that never change get recrawled monthly.
Content Storage#
Raw HTML Store#
Store raw HTML in an object store (S3) keyed by URL hash + crawl timestamp. This provides a historical archive and allows reprocessing.
Extracted Content Store#
Store parsed content in a columnar database (BigQuery, ClickHouse) for analytics, or in Elasticsearch for full-text search.
Metadata Store#
A key-value store (DynamoDB, Cassandra) tracks per-URL metadata: last crawl time, HTTP status, content hash, change frequency estimate.
Scale Estimation#
Target: 1 Billion Pages Per Month#
| Metric | Value |
|---|---|
| Pages per month | 1,000,000,000 |
| Pages per second | ~385 |
| Avg page size | 500 KB |
| Bandwidth | ~190 MB/s |
| Storage per month | ~500 TB |
| Crawler nodes (200 pages/s each) | ~2 |
At 10 billion pages per month, scale to 20 nodes and proportionally more storage and bandwidth.
DNS Resolution#
At 385 fetches per second, DNS lookups become a bottleneck. Run a local DNS cache (e.g., dnsmasq) on each crawler node and pre-resolve hostnames in batches.
Failure Handling#
- HTTP errors — 4xx errors are permanent (remove URL); 5xx errors are transient (retry with backoff).
- Timeouts — retry once, then deprioritize.
- Spider traps — detect infinite URL patterns (e.g., calendar pages generating endless dates) by limiting crawl depth per host and detecting repetitive URL structures.
- Node failure — the consistent hash ring reassigns partitions to surviving nodes.
Key Takeaways#
- The URL frontier is the brain of the crawler — it manages priority, politeness, and freshness.
- Bloom filters provide memory-efficient URL deduplication at scale.
- Host-based partitioning simplifies distributed crawling by isolating per-host state.
- JavaScript rendering is expensive — use it selectively.
- Scale estimation grounds your design in reality and impresses interviewers.
A web crawler touches nearly every systems concept: networking, storage, distributed coordination, data structures, and scheduling. Mastering its design prepares you for a wide range of system design interviews.
Build and deploy full-stack projects with guided system design exercises at codelit.io.
This is article #184 in the Codelit engineering blog series.
Try it on Codelit
Chaos Mode
Simulate node failures and watch cascading impact across your architecture
Related articles
Try these templates
Build this architecture
Generate an interactive Web Crawler Architecture in seconds.
Try it in Codelit →
Comments