system designweb crawlerdistributed systemsbloom filtercrawling

Web Crawler Architecture: Designing a Scalable Internet Crawler

March 28, 2026 7 min readBy Codelit Team Discussion

Search engines, price comparison tools, and AI training pipelines all depend on web crawlers. Designing a crawler that can fetch billions of pages while respecting site policies and handling failures is a foundational system design problem. This guide walks through the complete architecture — from URL frontier to distributed crawling at scale.

High-Level Architecture#

A web crawler consists of four core components operating in a loop:

URL Frontier → Fetcher → Parser → Content Store
     ↑                      |
     └──── New URLs ─────────┘

URL Frontier — a priority queue of URLs to crawl.
Fetcher — downloads pages via HTTP.
Parser — extracts content and discovers new URLs.
Content Store — persists raw HTML and extracted data.

The parser feeds discovered URLs back into the frontier, creating the crawl loop.

The URL Frontier#

The frontier is far more than a simple queue. It enforces politeness, priority, and freshness.

Priority Queue#

Not all URLs are equally important. Assign priority based on:

PageRank or domain authority — crawl high-value sites first.
Update frequency — news sites change hourly; corporate about pages change yearly.
Depth — pages closer to the root tend to be more important.

Use a multi-level priority queue where each level drains at a different rate.

Politeness Enforcement#

A single-threaded per-host queue ensures the crawler never hammers one server. The frontier maps each host to its own FIFO queue and enforces a minimum delay between requests to the same host.

frontier/
  priority_queue  →  host_router  →  per_host_queues
                                       ├── example.com  [url1, url2]
                                       ├── news.org     [url3]
                                       └── shop.io      [url4, url5]

Back Queue vs Front Queue#

A common design splits the frontier into:

Front queues — multiple queues bucketed by priority.
Back queues — one queue per host, enforcing politeness.

A selector pulls from front queues by priority, then routes to the appropriate back queue.

Respecting robots.txt#

Before crawling any host, fetch and cache its /robots.txt. This file specifies:

Which paths are disallowed for your user-agent.
A Crawl-delay directive (seconds between requests).
Sitemap locations for discovery.

Cache robots.txt with a TTL of 24 hours. If the file is unavailable (HTTP 5xx), back off and retry. If it returns 404, assume full access.

The Fetcher#

The fetcher downloads pages and handles the messy realities of the web.

HTTP Client Configuration#

Timeout — 30 seconds for connect, 60 seconds for read.
Retries — exponential backoff with jitter, max 3 attempts.
User-Agent — identify your crawler honestly (e.g., CodelitBot/1.0).
Redirect handling — follow up to 5 redirects, then abort.
Compression — request gzip/brotli to reduce bandwidth.

Handling JavaScript-Rendered Pages#

Modern SPAs render content client-side. A basic HTTP fetch returns an empty shell.

Option 1: Headless Browser Use Puppeteer or Playwright to render pages. Expensive — each render takes 2–5 seconds and significant memory. Reserve for high-value domains.

Option 2: Dynamic Rendering Service A service like Rendertron sits between the fetcher and the web. It detects JS-heavy pages and renders them on demand, caching the output.

Option 3: Hybrid Approach First attempt a static fetch. If the HTML body is suspiciously small (< 1KB) or lacks expected content markers, re-fetch with a headless browser.

The Parser#

The parser extracts two things: content and links.

Content Extraction#

Strip boilerplate (nav, footer, ads) using readability algorithms or DOM-based heuristics.
Extract structured data (JSON-LD, microdata, Open Graph tags).
Detect content type and language.
Compute a content fingerprint (SimHash or MinHash) for near-duplicate detection.

Link Extraction#

Resolve relative URLs to absolute.
Normalize URLs: lowercase the scheme and host, remove default ports, sort query parameters, strip fragments.
Filter out non-HTML resources (images, PDFs) unless explicitly wanted.

URL Deduplication with Bloom Filters#

A crawler at scale encounters the same URL millions of times. Storing every seen URL in a hash set consumes enormous memory.

Bloom Filters#

A Bloom filter is a space-efficient probabilistic data structure that answers: "Have I seen this URL before?"

False positives — occasionally says "yes" when the answer is "no" (causes a missed crawl — acceptable).
False negatives — never happen (no duplicate crawls).

A Bloom filter with 10 billion entries and a 1% false positive rate requires about 1.2 GB of memory — far less than storing the actual URLs.

Partitioned Bloom Filters#

Shard the Bloom filter across crawler nodes by hashing the URL and routing to the responsible shard. Each shard maintains its own filter.

Distributed Crawling#

A single machine can fetch roughly 100–200 pages per second. To crawl billions of pages, distribute the work.

Architecture#

Coordinator
  ├── Crawler Node 1  (hosts a–f)
  ├── Crawler Node 2  (hosts g–m)
  ├── Crawler Node 3  (hosts n–s)
  └── Crawler Node 4  (hosts t–z)

Host-Based Partitioning#

Assign each crawler node a range of hostnames (by consistent hashing). This ensures:

Only one node crawls a given host — no duplicate politeness tracking.
Per-host state (robots.txt cache, crawl delay timers) lives on one machine.
Rebalancing on node failure uses the consistent hash ring.

Coordination#

Use a distributed message queue (Kafka) for URL distribution. Discovered URLs are published to a topic, partitioned by host hash. Each crawler node consumes its assigned partitions.

Priority Scheduling#

Static Priority#

Assign scores at URL discovery time based on domain authority, path depth, and content type.

Dynamic Priority#

Adjust priority based on crawl history:

Pages that changed since last crawl get higher priority.
Pages that returned errors get deprioritized.
High-engagement pages (from analytics data) get boosted.

Freshness Scheduling#

Estimate each page's change frequency. Use a Poisson process model:

next_crawl = last_crawl + 1 / estimated_change_rate

Pages that change daily get recrawled daily. Pages that never change get recrawled monthly.

Content Storage#

Raw HTML Store#

Store raw HTML in an object store (S3) keyed by URL hash + crawl timestamp. This provides a historical archive and allows reprocessing.

Extracted Content Store#

Store parsed content in a columnar database (BigQuery, ClickHouse) for analytics, or in Elasticsearch for full-text search.

Metadata Store#

A key-value store (DynamoDB, Cassandra) tracks per-URL metadata: last crawl time, HTTP status, content hash, change frequency estimate.

Scale Estimation#

Target: 1 Billion Pages Per Month#

Metric	Value
Pages per month	1,000,000,000
Pages per second	~385
Avg page size	500 KB
Bandwidth	~190 MB/s
Storage per month	~500 TB
Crawler nodes (200 pages/s each)	~2

At 10 billion pages per month, scale to 20 nodes and proportionally more storage and bandwidth.

DNS Resolution#

At 385 fetches per second, DNS lookups become a bottleneck. Run a local DNS cache (e.g., dnsmasq) on each crawler node and pre-resolve hostnames in batches.

Failure Handling#

HTTP errors — 4xx errors are permanent (remove URL); 5xx errors are transient (retry with backoff).
Timeouts — retry once, then deprioritize.
Spider traps — detect infinite URL patterns (e.g., calendar pages generating endless dates) by limiting crawl depth per host and detecting repetitive URL structures.
Node failure — the consistent hash ring reassigns partitions to surviving nodes.

Key Takeaways#

The URL frontier is the brain of the crawler — it manages priority, politeness, and freshness.
Bloom filters provide memory-efficient URL deduplication at scale.
Host-based partitioning simplifies distributed crawling by isolating per-host state.
JavaScript rendering is expensive — use it selectively.
Scale estimation grounds your design in reality and impresses interviewers.

A web crawler touches nearly every systems concept: networking, storage, distributed coordination, data structures, and scheduling. Mastering its design prepares you for a wide range of system design interviews.

Build and deploy full-stack projects with guided system design exercises at codelit.io.

This is article #184 in the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Build this architecture

Generate an interactive Web Crawler Architecture in seconds.

Try it in Codelit →

system designweb crawlerdistributed systemsbloom filtercrawling

Web Crawler Architecture: Designing a Scalable Internet Crawler

March 28, 2026 7 min readBy Codelit Team Discussion

High-Level Architecture#

A web crawler consists of four core components operating in a loop:

URL Frontier → Fetcher → Parser → Content Store
     ↑                      |
     └──── New URLs ─────────┘

URL Frontier — a priority queue of URLs to crawl.
Fetcher — downloads pages via HTTP.
Parser — extracts content and discovers new URLs.
Content Store — persists raw HTML and extracted data.

The parser feeds discovered URLs back into the frontier, creating the crawl loop.

The URL Frontier#

The frontier is far more than a simple queue. It enforces politeness, priority, and freshness.

Priority Queue#

Not all URLs are equally important. Assign priority based on:

PageRank or domain authority — crawl high-value sites first.
Update frequency — news sites change hourly; corporate about pages change yearly.
Depth — pages closer to the root tend to be more important.

Use a multi-level priority queue where each level drains at a different rate.

Politeness Enforcement#

A single-threaded per-host queue ensures the crawler never hammers one server. The frontier maps each host to its own FIFO queue and enforces a minimum delay between requests to the same host.

frontier/
  priority_queue  →  host_router  →  per_host_queues
                                       ├── example.com  [url1, url2]
                                       ├── news.org     [url3]
                                       └── shop.io      [url4, url5]

Back Queue vs Front Queue#

A common design splits the frontier into:

Front queues — multiple queues bucketed by priority.
Back queues — one queue per host, enforcing politeness.

A selector pulls from front queues by priority, then routes to the appropriate back queue.

Respecting robots.txt#

Before crawling any host, fetch and cache its /robots.txt. This file specifies:

Which paths are disallowed for your user-agent.
A Crawl-delay directive (seconds between requests).
Sitemap locations for discovery.

Cache robots.txt with a TTL of 24 hours. If the file is unavailable (HTTP 5xx), back off and retry. If it returns 404, assume full access.

The Fetcher#

The fetcher downloads pages and handles the messy realities of the web.

HTTP Client Configuration#

Timeout — 30 seconds for connect, 60 seconds for read.
Retries — exponential backoff with jitter, max 3 attempts.
User-Agent — identify your crawler honestly (e.g., CodelitBot/1.0).
Redirect handling — follow up to 5 redirects, then abort.
Compression — request gzip/brotli to reduce bandwidth.

Handling JavaScript-Rendered Pages#

Modern SPAs render content client-side. A basic HTTP fetch returns an empty shell.

Option 1: Headless Browser Use Puppeteer or Playwright to render pages. Expensive — each render takes 2–5 seconds and significant memory. Reserve for high-value domains.

Option 2: Dynamic Rendering Service A service like Rendertron sits between the fetcher and the web. It detects JS-heavy pages and renders them on demand, caching the output.

Option 3: Hybrid Approach First attempt a static fetch. If the HTML body is suspiciously small (< 1KB) or lacks expected content markers, re-fetch with a headless browser.

The Parser#

The parser extracts two things: content and links.

Content Extraction#

Strip boilerplate (nav, footer, ads) using readability algorithms or DOM-based heuristics.
Extract structured data (JSON-LD, microdata, Open Graph tags).
Detect content type and language.
Compute a content fingerprint (SimHash or MinHash) for near-duplicate detection.

Link Extraction#

Resolve relative URLs to absolute.
Normalize URLs: lowercase the scheme and host, remove default ports, sort query parameters, strip fragments.
Filter out non-HTML resources (images, PDFs) unless explicitly wanted.

URL Deduplication with Bloom Filters#

A crawler at scale encounters the same URL millions of times. Storing every seen URL in a hash set consumes enormous memory.

Bloom Filters#

A Bloom filter is a space-efficient probabilistic data structure that answers: "Have I seen this URL before?"

False positives — occasionally says "yes" when the answer is "no" (causes a missed crawl — acceptable).
False negatives — never happen (no duplicate crawls).

A Bloom filter with 10 billion entries and a 1% false positive rate requires about 1.2 GB of memory — far less than storing the actual URLs.

Partitioned Bloom Filters#

Shard the Bloom filter across crawler nodes by hashing the URL and routing to the responsible shard. Each shard maintains its own filter.

Distributed Crawling#

A single machine can fetch roughly 100–200 pages per second. To crawl billions of pages, distribute the work.

Architecture#

Coordinator
  ├── Crawler Node 1  (hosts a–f)
  ├── Crawler Node 2  (hosts g–m)
  ├── Crawler Node 3  (hosts n–s)
  └── Crawler Node 4  (hosts t–z)

Host-Based Partitioning#

Assign each crawler node a range of hostnames (by consistent hashing). This ensures:

Only one node crawls a given host — no duplicate politeness tracking.
Per-host state (robots.txt cache, crawl delay timers) lives on one machine.
Rebalancing on node failure uses the consistent hash ring.

Coordination#

Use a distributed message queue (Kafka) for URL distribution. Discovered URLs are published to a topic, partitioned by host hash. Each crawler node consumes its assigned partitions.

Priority Scheduling#

Static Priority#

Assign scores at URL discovery time based on domain authority, path depth, and content type.

Dynamic Priority#

Adjust priority based on crawl history:

Pages that changed since last crawl get higher priority.
Pages that returned errors get deprioritized.
High-engagement pages (from analytics data) get boosted.

Freshness Scheduling#

Estimate each page's change frequency. Use a Poisson process model:

next_crawl = last_crawl + 1 / estimated_change_rate

Pages that change daily get recrawled daily. Pages that never change get recrawled monthly.

Content Storage#

Raw HTML Store#

Store raw HTML in an object store (S3) keyed by URL hash + crawl timestamp. This provides a historical archive and allows reprocessing.

Extracted Content Store#

Store parsed content in a columnar database (BigQuery, ClickHouse) for analytics, or in Elasticsearch for full-text search.

Metadata Store#

A key-value store (DynamoDB, Cassandra) tracks per-URL metadata: last crawl time, HTTP status, content hash, change frequency estimate.

Scale Estimation#

Target: 1 Billion Pages Per Month#

Metric	Value
Pages per month	1,000,000,000
Pages per second	~385
Avg page size	500 KB
Bandwidth	~190 MB/s
Storage per month	~500 TB
Crawler nodes (200 pages/s each)	~2

At 10 billion pages per month, scale to 20 nodes and proportionally more storage and bandwidth.

DNS Resolution#

At 385 fetches per second, DNS lookups become a bottleneck. Run a local DNS cache (e.g., dnsmasq) on each crawler node and pre-resolve hostnames in batches.

Failure Handling#

HTTP errors — 4xx errors are permanent (remove URL); 5xx errors are transient (retry with backoff).
Timeouts — retry once, then deprioritize.
Spider traps — detect infinite URL patterns (e.g., calendar pages generating endless dates) by limiting crawl depth per host and detecting repetitive URL structures.
Node failure — the consistent hash ring reassigns partitions to surviving nodes.

Key Takeaways#

The URL frontier is the brain of the crawler — it manages priority, politeness, and freshness.
Bloom filters provide memory-efficient URL deduplication at scale.
Host-based partitioning simplifies distributed crawling by isolating per-host state.
JavaScript rendering is expensive — use it selectively.
Scale estimation grounds your design in reality and impresses interviewers.

Build and deploy full-stack projects with guided system design exercises at codelit.io.

This is article #184 in the Codelit engineering blog series.

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Build this architecture →

Comments

AI search

Try these templates

Scalable SaaS Application

Modern SaaS with microservices, event-driven processing, and multi-tenant architecture.

10 components

Build this architecture

Generate an interactive Web Crawler Architecture in seconds.

Try it in Codelit →

Web Crawler Architecture: Designing a Scalable Internet Crawler

High-Level Architecture#

The URL Frontier#

Priority Queue#

Politeness Enforcement#

Back Queue vs Front Queue#

Respecting robots.txt#

The Fetcher#

HTTP Client Configuration#

Handling JavaScript-Rendered Pages#

The Parser#

Content Extraction#

Link Extraction#

URL Deduplication with Bloom Filters#

Bloom Filters#

Partitioned Bloom Filters#

Distributed Crawling#

Architecture#

Host-Based Partitioning#

Coordination#

Priority Scheduling#

Static Priority#

Dynamic Priority#

Freshness Scheduling#

Content Storage#

Raw HTML Store#

Extracted Content Store#

Metadata Store#

Scale Estimation#

Target: 1 Billion Pages Per Month#

DNS Resolution#

Failure Handling#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Scalable SaaS Application

Build this architecture

Web Crawler Architecture: Designing a Scalable Internet Crawler

High-Level Architecture#

The URL Frontier#

Priority Queue#

Politeness Enforcement#

Back Queue vs Front Queue#

Respecting robots.txt#

The Fetcher#

HTTP Client Configuration#

Handling JavaScript-Rendered Pages#

The Parser#

Content Extraction#

Link Extraction#

URL Deduplication with Bloom Filters#

Bloom Filters#

Partitioned Bloom Filters#

Distributed Crawling#

Architecture#

Host-Based Partitioning#

Coordination#

Priority Scheduling#

Static Priority#

Dynamic Priority#

Freshness Scheduling#

Content Storage#

Raw HTML Store#

Extracted Content Store#

Metadata Store#

Scale Estimation#

Target: 1 Billion Pages Per Month#

DNS Resolution#

Failure Handling#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates