bloom filterdata structuresalgorithmsprobabilisticsystem design

Bloom Filters & Probabilistic Data Structures: Trading Precision for Speed

March 29, 2026 6 min readBy Codelit Team Discussion

Bloom Filters and Probabilistic Data Structures#

Sometimes you don't need the exact answer — you need a fast, space-efficient approximation. Probabilistic data structures trade perfect accuracy for dramatically lower memory and computation costs.

The Bloom Filter#

A bloom filter answers one question: "Is this element in the set?" with two possible responses:

"Definitely not in the set" — always correct
"Probably in the set" — might be wrong (false positive)

No false negatives. Ever.

How It Works#

A bloom filter is a bit array of size m with k independent hash functions:

Insert("hello"):
  h1("hello") = 3  → set bit 3
  h2("hello") = 7  → set bit 7
  h3("hello") = 12 → set bit 12

Bit array: [0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0]
                 ^           ^           ^

Query("hello"):
  Check bits 3, 7, 12 → all set → "probably yes"

Query("world"):
  h1("world") = 3  → set ✓
  h2("world") = 9  → NOT set ✗ → "definitely no"

If all k bits are set, the element is probably in the set. If any bit is unset, the element is definitely not in the set.

False Positive Rate#

The probability of a false positive after inserting n elements:

p ≈ (1 - e^(-kn/m))^k

Where:
  m = bit array size
  k = number of hash functions
  n = number of inserted elements

Optimal number of hash functions: k = (m/n) * ln(2)

Practical guideline: 10 bits per element gives ~1% false positive rate with 7 hash functions.

Space Efficiency#

Comparing membership testing for 1 billion elements:

HashSet:        ~40 GB (pointers + objects)
Bloom filter:   ~1.2 GB (10 bits/element, 1% FPR)
Bloom filter:   ~2.4 GB (20 bits/element, 0.01% FPR)

That's a 16-33x reduction in memory.

Limitations#

No deletion — unsetting a bit might affect other elements
No enumeration — you can't list what's in the filter
Fixed capacity — FPR degrades as you add more elements than planned
No counting — you can't ask "how many times was X inserted?"

Counting Bloom Filters#

Replace each bit with a small counter (typically 4 bits):

Insert("hello"):
  counter[3] += 1  → counter[3] = 1
  counter[7] += 1  → counter[7] = 1
  counter[12] += 1 → counter[12] = 1

Delete("hello"):
  counter[3] -= 1  → counter[3] = 0
  counter[7] -= 1  → counter[7] = 0
  counter[12] -= 1 → counter[12] = 0

Trade-off: 4x more memory than a standard bloom filter, but supports deletions. Counter overflow (beyond 15 with 4-bit counters) is rare in practice.

Cuckoo Filters#

A modern alternative that stores fingerprints instead of setting bits:

Insert("hello"):
  fingerprint = fp("hello") = 0xA3
  bucket1 = h1("hello") = 5
  bucket2 = bucket1 XOR h(fingerprint) = 11

  Store 0xA3 in bucket 5 (or 11 if 5 is full)

Advantages Over Bloom Filters#

Supports deletion without the overhead of counting filters
Better space efficiency at low FPR (below ~3%)
Faster lookups — only 2 memory accesses vs k for bloom filters
Cache-friendly — data locality within buckets

When to Prefer Bloom Filters#

Higher false positive rates are acceptable (above ~3%)
Simpler implementation needed
No deletion required

HyperLogLog (HLL)#

Estimates cardinality — the number of distinct elements in a set — using remarkably little memory.

The Intuition#

Hash each element and look at leading zeros in the binary representation:

hash("user_1")  = 0001...  → 3 leading zeros
hash("user_2")  = 1010...  → 0 leading zeros
hash("user_42") = 0000001. → 6 leading zeros

Max leading zeros observed: 6
Estimated cardinality ≈ 2^6 = 64

More elements → higher chance of seeing more leading zeros.

Practical HyperLogLog#

Uses 2^p registers (typically p=14, so 16,384 registers), each storing the maximum leading zeros seen:

Memory: 16,384 registers × 6 bits = ~12 KB
Accuracy: ±0.81% standard error
Capacity: billions of unique elements

12 KB to count billions of unique items with sub-1% error. Used by Redis (PFADD/PFCOUNT), BigQuery, and most analytics platforms.

Count-Min Sketch#

Estimates frequency — how many times each element has been seen:

Structure: d rows × w columns (d hash functions)

Increment("hello"):
  row 0: h0("hello") = 3  → table[0][3] += 1
  row 1: h1("hello") = 7  → table[1][7] += 1
  row 2: h2("hello") = 1  → table[2][1] += 1

Query("hello"):
  return min(table[0][3], table[1][7], table[2][1])

The minimum across rows is the estimate. It never underestimates but may overestimate due to hash collisions.

Parameters#

Width w = ⌈e/ε⌉     (e = Euler's number, ε = error tolerance)
Depth d = ⌈ln(1/δ)⌉  (δ = failure probability)

Example: ε = 0.01, δ = 0.01
  w = 272 columns, d = 5 rows
  Memory: 272 × 5 × 4 bytes ≈ 5.4 KB

Real-World Use Cases#

Cache Filtering#

CDN servers use bloom filters to decide what to cache:

Request arrives for resource X
  → bloom_filter.contains(X)?
    → NO:  add X to filter (first request, don't cache)
    → YES: cache X (second+ request, worth caching)

Akamai reported this approach avoids caching one-hit wonders, which constitute ~75% of requests.

Spam and Malware Detection#

known_malicious_urls = BloomFilter(capacity=10_billion, fpr=0.001)

check_url(url):
  if not known_malicious_urls.contains(url):
    return SAFE           # definitely not malicious
  else:
    return full_check(url) # might be malicious, do expensive check

Chrome's Safe Browsing uses this pattern — a local bloom filter avoids sending every URL to Google's servers.

Database Query Optimization#

LSM-tree databases (LevelDB, RocksDB, Cassandra) attach a bloom filter to each SSTable:

Query key K:
  For each SSTable (newest to oldest):
    if sstable.bloom_filter.might_contain(K):
      search this SSTable  # expensive disk read
    else:
      skip this SSTable    # free, no I/O

This avoids reading SSTables that definitely don't contain the key — a massive performance win for read-heavy workloads.

Stream Deduplication#

seen = BloomFilter(capacity=100_million, fpr=0.01)

process_event(event):
  if seen.contains(event.id):
    return  # likely duplicate, skip
  seen.add(event.id)
  handle(event)

At 1% FPR, you'll skip ~1% of unique events but guarantee no duplicate processing. For many use cases (analytics, logging), this trade-off is excellent.

Choosing the Right Structure#

Structure	Answers	Error Type	Memory
Bloom filter	"Is X in set?"	False positives only	~10 bits/element
Counting bloom	"Is X in set?" + delete	False positives only	~40 bits/element
Cuckoo filter	"Is X in set?" + delete	False positives only	~12 bits/element
HyperLogLog	"How many distinct?"	±0.81%	~12 KB fixed
Count-Min Sketch	"How often X?"	Overestimates only	Configurable

Key Takeaways#

Bloom filters provide set membership with zero false negatives and tunable false positives
10 bits per element gives ~1% FPR — sufficient for most applications
Cuckoo filters beat bloom filters when you need deletion or low FPR
HyperLogLog counts billions of unique items in 12 KB
Count-Min Sketch tracks frequencies in bounded space
These structures are everywhere: caches, databases, networking, analytics

This is article #226 in the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Try these templates

Cryptocurrency Exchange

Coinbase-like crypto exchange with order matching, wallet management, KYC, and real-time market data.

8 components

Data Warehouse & Analytics

Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.

8 components

Robinhood Trading Platform

Commission-free stock trading with real-time market data, order execution, portfolio management, and regulatory compliance.

10 components

Build this architecture

Generate an interactive architecture for Bloom Filters & Probabilistic Data Structures in seconds.

Try it in Codelit →

bloom filterdata structuresalgorithmsprobabilisticsystem design

Bloom Filters & Probabilistic Data Structures: Trading Precision for Speed

March 29, 2026 6 min readBy Codelit Team Discussion

Bloom Filters and Probabilistic Data Structures#

The Bloom Filter#

A bloom filter answers one question: "Is this element in the set?" with two possible responses:

"Definitely not in the set" — always correct
"Probably in the set" — might be wrong (false positive)

No false negatives. Ever.

How It Works#

A bloom filter is a bit array of size m with k independent hash functions:

Insert("hello"):
  h1("hello") = 3  → set bit 3
  h2("hello") = 7  → set bit 7
  h3("hello") = 12 → set bit 12

Bit array: [0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0]
                 ^           ^           ^

Query("hello"):
  Check bits 3, 7, 12 → all set → "probably yes"

Query("world"):
  h1("world") = 3  → set ✓
  h2("world") = 9  → NOT set ✗ → "definitely no"

If all k bits are set, the element is probably in the set. If any bit is unset, the element is definitely not in the set.

False Positive Rate#

The probability of a false positive after inserting n elements:

p ≈ (1 - e^(-kn/m))^k

Where:
  m = bit array size
  k = number of hash functions
  n = number of inserted elements

Optimal number of hash functions: k = (m/n) * ln(2)

Practical guideline: 10 bits per element gives ~1% false positive rate with 7 hash functions.

Space Efficiency#

Comparing membership testing for 1 billion elements:

HashSet:        ~40 GB (pointers + objects)
Bloom filter:   ~1.2 GB (10 bits/element, 1% FPR)
Bloom filter:   ~2.4 GB (20 bits/element, 0.01% FPR)

That's a 16-33x reduction in memory.

Limitations#

No deletion — unsetting a bit might affect other elements
No enumeration — you can't list what's in the filter
Fixed capacity — FPR degrades as you add more elements than planned
No counting — you can't ask "how many times was X inserted?"

Counting Bloom Filters#

Replace each bit with a small counter (typically 4 bits):

Insert("hello"):
  counter[3] += 1  → counter[3] = 1
  counter[7] += 1  → counter[7] = 1
  counter[12] += 1 → counter[12] = 1

Delete("hello"):
  counter[3] -= 1  → counter[3] = 0
  counter[7] -= 1  → counter[7] = 0
  counter[12] -= 1 → counter[12] = 0

Trade-off: 4x more memory than a standard bloom filter, but supports deletions. Counter overflow (beyond 15 with 4-bit counters) is rare in practice.

Cuckoo Filters#

A modern alternative that stores fingerprints instead of setting bits:

Insert("hello"):
  fingerprint = fp("hello") = 0xA3
  bucket1 = h1("hello") = 5
  bucket2 = bucket1 XOR h(fingerprint) = 11

  Store 0xA3 in bucket 5 (or 11 if 5 is full)

Advantages Over Bloom Filters#

Supports deletion without the overhead of counting filters
Better space efficiency at low FPR (below ~3%)
Faster lookups — only 2 memory accesses vs k for bloom filters
Cache-friendly — data locality within buckets

When to Prefer Bloom Filters#

Higher false positive rates are acceptable (above ~3%)
Simpler implementation needed
No deletion required

HyperLogLog (HLL)#

Estimates cardinality — the number of distinct elements in a set — using remarkably little memory.

The Intuition#

Hash each element and look at leading zeros in the binary representation:

hash("user_1")  = 0001...  → 3 leading zeros
hash("user_2")  = 1010...  → 0 leading zeros
hash("user_42") = 0000001. → 6 leading zeros

Max leading zeros observed: 6
Estimated cardinality ≈ 2^6 = 64

More elements → higher chance of seeing more leading zeros.

Practical HyperLogLog#

Uses 2^p registers (typically p=14, so 16,384 registers), each storing the maximum leading zeros seen:

Memory: 16,384 registers × 6 bits = ~12 KB
Accuracy: ±0.81% standard error
Capacity: billions of unique elements

12 KB to count billions of unique items with sub-1% error. Used by Redis (PFADD/PFCOUNT), BigQuery, and most analytics platforms.

Count-Min Sketch#

Estimates frequency — how many times each element has been seen:

Structure: d rows × w columns (d hash functions)

Increment("hello"):
  row 0: h0("hello") = 3  → table[0][3] += 1
  row 1: h1("hello") = 7  → table[1][7] += 1
  row 2: h2("hello") = 1  → table[2][1] += 1

Query("hello"):
  return min(table[0][3], table[1][7], table[2][1])

The minimum across rows is the estimate. It never underestimates but may overestimate due to hash collisions.

Parameters#

Width w = ⌈e/ε⌉     (e = Euler's number, ε = error tolerance)
Depth d = ⌈ln(1/δ)⌉  (δ = failure probability)

Example: ε = 0.01, δ = 0.01
  w = 272 columns, d = 5 rows
  Memory: 272 × 5 × 4 bytes ≈ 5.4 KB

Real-World Use Cases#

Cache Filtering#

CDN servers use bloom filters to decide what to cache:

Request arrives for resource X
  → bloom_filter.contains(X)?
    → NO:  add X to filter (first request, don't cache)
    → YES: cache X (second+ request, worth caching)

Akamai reported this approach avoids caching one-hit wonders, which constitute ~75% of requests.

Spam and Malware Detection#

known_malicious_urls = BloomFilter(capacity=10_billion, fpr=0.001)

check_url(url):
  if not known_malicious_urls.contains(url):
    return SAFE           # definitely not malicious
  else:
    return full_check(url) # might be malicious, do expensive check

Chrome's Safe Browsing uses this pattern — a local bloom filter avoids sending every URL to Google's servers.

Database Query Optimization#

LSM-tree databases (LevelDB, RocksDB, Cassandra) attach a bloom filter to each SSTable:

Query key K:
  For each SSTable (newest to oldest):
    if sstable.bloom_filter.might_contain(K):
      search this SSTable  # expensive disk read
    else:
      skip this SSTable    # free, no I/O

This avoids reading SSTables that definitely don't contain the key — a massive performance win for read-heavy workloads.

Stream Deduplication#

seen = BloomFilter(capacity=100_million, fpr=0.01)

process_event(event):
  if seen.contains(event.id):
    return  # likely duplicate, skip
  seen.add(event.id)
  handle(event)

At 1% FPR, you'll skip ~1% of unique events but guarantee no duplicate processing. For many use cases (analytics, logging), this trade-off is excellent.

Choosing the Right Structure#

Structure	Answers	Error Type	Memory
Bloom filter	"Is X in set?"	False positives only	~10 bits/element
Counting bloom	"Is X in set?" + delete	False positives only	~40 bits/element
Cuckoo filter	"Is X in set?" + delete	False positives only	~12 bits/element
HyperLogLog	"How many distinct?"	±0.81%	~12 KB fixed
Count-Min Sketch	"How often X?"	Overestimates only	Configurable

Key Takeaways#

Bloom filters provide set membership with zero false negatives and tunable false positives
10 bits per element gives ~1% FPR — sufficient for most applications
Cuckoo filters beat bloom filters when you need deletion or low FPR
HyperLogLog counts billions of unique items in 12 KB
Count-Min Sketch tracks frequencies in bounded space
These structures are everywhere: caches, databases, networking, analytics

This is article #226 in the Codelit system design series. Explore all articles at codelit.io.

Try it on Codelit

GitHub Integration

Paste any repo URL to generate an interactive architecture diagram from real code

Build this architecture →

Comments

AI search

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

8 min read

AI safety

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

8 min read

API design

API Backward Compatibility: Ship Changes Without Breaking Consumers

6 min read

Build this architecture

Generate an interactive architecture for Bloom Filters & Probabilistic Data Structures in seconds.

Try it in Codelit →

Bloom Filters & Probabilistic Data Structures: Trading Precision for Speed

Bloom Filters and Probabilistic Data Structures#

The Bloom Filter#

How It Works#

False Positive Rate#

Space Efficiency#

Limitations#

Counting Bloom Filters#

Cuckoo Filters#

Advantages Over Bloom Filters#

When to Prefer Bloom Filters#

HyperLogLog (HLL)#

The Intuition#

Practical HyperLogLog#

Count-Min Sketch#

Parameters#

Real-World Use Cases#

Cache Filtering#

Spam and Malware Detection#

Database Query Optimization#

Stream Deduplication#

Choosing the Right Structure#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Cryptocurrency Exchange

Data Warehouse & Analytics

Robinhood Trading Platform

Build this architecture

Bloom Filters & Probabilistic Data Structures: Trading Precision for Speed

Bloom Filters and Probabilistic Data Structures#

The Bloom Filter#

How It Works#

False Positive Rate#

Space Efficiency#

Limitations#

Counting Bloom Filters#

Cuckoo Filters#

Advantages Over Bloom Filters#

When to Prefer Bloom Filters#

HyperLogLog (HLL)#

The Intuition#

Practical HyperLogLog#

Count-Min Sketch#

Parameters#

Real-World Use Cases#

Cache Filtering#

Spam and Malware Detection#

Database Query Optimization#

Stream Deduplication#

Choosing the Right Structure#

Key Takeaways#

Comments

Related articles

AI-Powered Search Architecture: Semantic Search, Hybrid Search, and RAG

AI Safety Guardrails Architecture: Input Validation, Output Filtering, and Human-in-the-Loop

API Backward Compatibility: Ship Changes Without Breaking Consumers

Try these templates

Cryptocurrency Exchange

Data Warehouse & Analytics

Robinhood Trading Platform

Build this architecture