system-designinterviewinfrastructure

Design a Web Crawler — Architecture, Politeness, and Scale

March 24, 2026 3 min readBy Mo Discussion

What makes crawling hard#

A web crawler sounds simple: fetch pages, extract links, repeat. But at scale, you face rate limiting, duplicate URLs, infinite loops, dynamic content, politeness rules, and petabytes of storage.

Google crawls billions of pages. Even a modest crawler for a company's internal tools needs careful architecture.

Core components#

URL Frontier#

A priority queue of URLs to crawl. Not a simple FIFO — priorities matter:

Freshness priority — News sites get crawled every few minutes, static pages monthly
Importance priority — High PageRank pages first
Politeness queues — Separate queue per domain to avoid overwhelming servers

Fetcher#

HTTP client that downloads pages. Key considerations:

Robots.txt — Check and respect rules before crawling any domain
Crawl delay — Honor the delay specified in robots.txt
DNS resolution — Cache DNS lookups to avoid bottlenecks
Timeout handling — Don't wait forever for slow servers

Content Parser#

Extract useful data from downloaded pages:

Link extraction — Find all URLs in the page (href, src, srcset)
Content extraction — Strip HTML, extract text, metadata, images
Duplicate detection — SimHash or MinHash to detect near-duplicate content
Content type — Handle HTML, PDF, XML, JSON differently

URL Deduplication#

The same page can be reached via many URLs. Normalize and deduplicate:

https://example.com/page
https://example.com/page/
https://example.com/page?ref=twitter
https://www.example.com/page
→ All normalize to: https://example.com/page

Use a Bloom filter or hash set to check "have I seen this URL before?" in O(1).

Architecture at scale#

URL Frontier (Redis/Kafka)
    ↓
Fetcher Workers (100+ parallel)
    ↓
Content Parser
    ↓
Link Extractor → URL Dedup → URL Frontier (loop)
    ↓
Content Store (S3/HDFS)
    ↓
Indexer

Distributed crawling#

Split work across machines:

Partition by domain — Each worker handles specific domains (consistent hashing)
Centralized frontier — Kafka topic or Redis queue shared by all workers
Dedup service — Shared Bloom filter or distributed hash table

Politeness policies#

Critical. An impolite crawler gets blocked, rate-limited, or causes legal problems.

Respect robots.txt — Always check before crawling
Crawl delay — Wait between requests to same domain (1-10 seconds)
Identify yourself — Set a descriptive User-Agent with contact info
Back off on errors — Exponential backoff on 429/5xx responses
Limit concurrent connections — Max 1-2 per domain at a time

Handling traps#

Spider traps — Infinite URLs generated dynamically:

/calendar/2026/03/24
/calendar/2026/03/25
/calendar/2026/03/26
... infinite dates

Solutions:

Maximum URL depth limit (e.g., 15 path segments)
Maximum pages per domain per crawl
URL pattern detection (detect repeating structures)
Domain-level timeout

Storage estimation#

Crawling 1 billion pages:

Average page: 100KB compressed
Total storage: 100TB
URLs to track: ~10 billion (10x pages due to duplicates)
URL storage: ~500GB (50 bytes per URL hash)

Visualize your crawler architecture#

See how the URL frontier, fetchers, parsers, and storage connect — try Codelit to generate an interactive diagram of a web crawler system.

Key takeaways#

URL Frontier is the brain — priority queue with politeness partitioning
Bloom filters for O(1) URL deduplication with minimal memory
Robots.txt is not optional — always respect it
Partition by domain for distributed crawling
SimHash for near-duplicate detection — catch pages with minor differences
Spider traps will break naive crawlers — depth limits are essential

{ }

Explore the Spotify architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

90+ Templates

Practice with real-world architectures — Uber, Netflix, Slack, and more

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Try these templates

WhatsApp-Scale Messaging System

End-to-end encrypted messaging with offline delivery, group chats, and media sharing at billions-of-messages scale.

9 components

Figma Collaborative Design Platform

Browser-based design tool with real-time multiplayer editing, component libraries, and developer handoff.

10 components

Gmail-Scale Email Service

Email platform handling billions of messages with spam filtering, search indexing, attachment storage, and push notifications.

10 components

Build this architecture

Generate an interactive Design a Web Crawler in seconds.

Try it in Codelit →

system-designinterviewinfrastructure

Design a Web Crawler — Architecture, Politeness, and Scale

March 24, 2026 3 min readBy Mo Discussion

What makes crawling hard#

A web crawler sounds simple: fetch pages, extract links, repeat. But at scale, you face rate limiting, duplicate URLs, infinite loops, dynamic content, politeness rules, and petabytes of storage.

Google crawls billions of pages. Even a modest crawler for a company's internal tools needs careful architecture.

Core components#

URL Frontier#

A priority queue of URLs to crawl. Not a simple FIFO — priorities matter:

Freshness priority — News sites get crawled every few minutes, static pages monthly
Importance priority — High PageRank pages first
Politeness queues — Separate queue per domain to avoid overwhelming servers

Fetcher#

HTTP client that downloads pages. Key considerations:

Robots.txt — Check and respect rules before crawling any domain
Crawl delay — Honor the delay specified in robots.txt
DNS resolution — Cache DNS lookups to avoid bottlenecks
Timeout handling — Don't wait forever for slow servers

Content Parser#

Extract useful data from downloaded pages:

Link extraction — Find all URLs in the page (href, src, srcset)
Content extraction — Strip HTML, extract text, metadata, images
Duplicate detection — SimHash or MinHash to detect near-duplicate content
Content type — Handle HTML, PDF, XML, JSON differently

URL Deduplication#

The same page can be reached via many URLs. Normalize and deduplicate:

https://example.com/page
https://example.com/page/
https://example.com/page?ref=twitter
https://www.example.com/page
→ All normalize to: https://example.com/page

Use a Bloom filter or hash set to check "have I seen this URL before?" in O(1).

Architecture at scale#

URL Frontier (Redis/Kafka)
    ↓
Fetcher Workers (100+ parallel)
    ↓
Content Parser
    ↓
Link Extractor → URL Dedup → URL Frontier (loop)
    ↓
Content Store (S3/HDFS)
    ↓
Indexer

Distributed crawling#

Split work across machines:

Partition by domain — Each worker handles specific domains (consistent hashing)
Centralized frontier — Kafka topic or Redis queue shared by all workers
Dedup service — Shared Bloom filter or distributed hash table

Politeness policies#

Critical. An impolite crawler gets blocked, rate-limited, or causes legal problems.

Respect robots.txt — Always check before crawling
Crawl delay — Wait between requests to same domain (1-10 seconds)
Identify yourself — Set a descriptive User-Agent with contact info
Back off on errors — Exponential backoff on 429/5xx responses
Limit concurrent connections — Max 1-2 per domain at a time

Handling traps#

Spider traps — Infinite URLs generated dynamically:

/calendar/2026/03/24
/calendar/2026/03/25
/calendar/2026/03/26
... infinite dates

Solutions:

Maximum URL depth limit (e.g., 15 path segments)
Maximum pages per domain per crawl
URL pattern detection (detect repeating structures)
Domain-level timeout

Storage estimation#

Crawling 1 billion pages:

Average page: 100KB compressed
Total storage: 100TB
URLs to track: ~10 billion (10x pages due to duplicates)
URL storage: ~500GB (50 bytes per URL hash)

Visualize your crawler architecture#

See how the URL frontier, fetchers, parsers, and storage connect — try Codelit to generate an interactive diagram of a web crawler system.

Key takeaways#

URL Frontier is the brain — priority queue with politeness partitioning
Bloom filters for O(1) URL deduplication with minimal memory
Robots.txt is not optional — always respect it
Partition by domain for distributed crawling
SimHash for near-duplicate detection — catch pages with minor differences
Spider traps will break naive crawlers — depth limits are essential

{ }

Explore the Spotify architecture interactively

Try it →

Try it on Codelit

Chaos Mode

Simulate node failures and watch cascading impact across your architecture

Cost Estimator

See estimated AWS monthly costs for every component in your architecture

90+ Templates

Practice with real-world architectures — Uber, Netflix, Slack, and more

Build this architecture →

Comments

api design

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

8 min read

system design

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

7 min read

api

API-First Design Methodology — Design Before You Implement

7 min read

Build this architecture

Generate an interactive Design a Web Crawler in seconds.

Try it in Codelit →

Design a Web Crawler — Architecture, Politeness, and Scale

What makes crawling hard#

Core components#

URL Frontier#

Fetcher#

Content Parser#

URL Deduplication#

Architecture at scale#

Distributed crawling#

Politeness policies#

Handling traps#

Storage estimation#

Visualize your crawler architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

WhatsApp-Scale Messaging System

Figma Collaborative Design Platform

Gmail-Scale Email Service

Build this architecture

Design a Web Crawler — Architecture, Politeness, and Scale

What makes crawling hard#

Core components#

URL Frontier#

Fetcher#

Content Parser#

URL Deduplication#

Architecture at scale#

Distributed crawling#

Politeness policies#

Handling traps#

Storage estimation#

Visualize your crawler architecture#

Key takeaways#

Comments

Related articles

Batch API Endpoints — Patterns for Bulk Operations, Partial Success, and Idempotency

Circuit Breaker Implementation — State Machine, Failure Counting, Fallbacks, and Resilience4j

API-First Design Methodology — Design Before You Implement

Try these templates

WhatsApp-Scale Messaging System

Figma Collaborative Design Platform

Gmail-Scale Email Service

Build this architecture