Bloom Filters & Probabilistic Data Structures: Trading Precision for Speed
Bloom Filters and Probabilistic Data Structures#
Sometimes you don't need the exact answer — you need a fast, space-efficient approximation. Probabilistic data structures trade perfect accuracy for dramatically lower memory and computation costs.
The Bloom Filter#
A bloom filter answers one question: "Is this element in the set?" with two possible responses:
- "Definitely not in the set" — always correct
- "Probably in the set" — might be wrong (false positive)
No false negatives. Ever.
How It Works#
A bloom filter is a bit array of size m with k independent hash functions:
Insert("hello"):
h1("hello") = 3 → set bit 3
h2("hello") = 7 → set bit 7
h3("hello") = 12 → set bit 12
Bit array: [0,0,0,1,0,0,0,1,0,0,0,0,1,0,0,0]
^ ^ ^
Query("hello"):
Check bits 3, 7, 12 → all set → "probably yes"
Query("world"):
h1("world") = 3 → set ✓
h2("world") = 9 → NOT set ✗ → "definitely no"
If all k bits are set, the element is probably in the set. If any bit is unset, the element is definitely not in the set.
False Positive Rate#
The probability of a false positive after inserting n elements:
p ≈ (1 - e^(-kn/m))^k
Where:
m = bit array size
k = number of hash functions
n = number of inserted elements
Optimal number of hash functions: k = (m/n) * ln(2)
Practical guideline: 10 bits per element gives ~1% false positive rate with 7 hash functions.
Space Efficiency#
Comparing membership testing for 1 billion elements:
HashSet: ~40 GB (pointers + objects)
Bloom filter: ~1.2 GB (10 bits/element, 1% FPR)
Bloom filter: ~2.4 GB (20 bits/element, 0.01% FPR)
That's a 16-33x reduction in memory.
Limitations#
- No deletion — unsetting a bit might affect other elements
- No enumeration — you can't list what's in the filter
- Fixed capacity — FPR degrades as you add more elements than planned
- No counting — you can't ask "how many times was X inserted?"
Counting Bloom Filters#
Replace each bit with a small counter (typically 4 bits):
Insert("hello"):
counter[3] += 1 → counter[3] = 1
counter[7] += 1 → counter[7] = 1
counter[12] += 1 → counter[12] = 1
Delete("hello"):
counter[3] -= 1 → counter[3] = 0
counter[7] -= 1 → counter[7] = 0
counter[12] -= 1 → counter[12] = 0
Trade-off: 4x more memory than a standard bloom filter, but supports deletions. Counter overflow (beyond 15 with 4-bit counters) is rare in practice.
Cuckoo Filters#
A modern alternative that stores fingerprints instead of setting bits:
Insert("hello"):
fingerprint = fp("hello") = 0xA3
bucket1 = h1("hello") = 5
bucket2 = bucket1 XOR h(fingerprint) = 11
Store 0xA3 in bucket 5 (or 11 if 5 is full)
Advantages Over Bloom Filters#
- Supports deletion without the overhead of counting filters
- Better space efficiency at low FPR (below ~3%)
- Faster lookups — only 2 memory accesses vs k for bloom filters
- Cache-friendly — data locality within buckets
When to Prefer Bloom Filters#
- Higher false positive rates are acceptable (above ~3%)
- Simpler implementation needed
- No deletion required
HyperLogLog (HLL)#
Estimates cardinality — the number of distinct elements in a set — using remarkably little memory.
The Intuition#
Hash each element and look at leading zeros in the binary representation:
hash("user_1") = 0001... → 3 leading zeros
hash("user_2") = 1010... → 0 leading zeros
hash("user_42") = 0000001. → 6 leading zeros
Max leading zeros observed: 6
Estimated cardinality ≈ 2^6 = 64
More elements → higher chance of seeing more leading zeros.
Practical HyperLogLog#
Uses 2^p registers (typically p=14, so 16,384 registers), each storing the maximum leading zeros seen:
Memory: 16,384 registers × 6 bits = ~12 KB
Accuracy: ±0.81% standard error
Capacity: billions of unique elements
12 KB to count billions of unique items with sub-1% error. Used by Redis (PFADD/PFCOUNT), BigQuery, and most analytics platforms.
Count-Min Sketch#
Estimates frequency — how many times each element has been seen:
Structure: d rows × w columns (d hash functions)
Increment("hello"):
row 0: h0("hello") = 3 → table[0][3] += 1
row 1: h1("hello") = 7 → table[1][7] += 1
row 2: h2("hello") = 1 → table[2][1] += 1
Query("hello"):
return min(table[0][3], table[1][7], table[2][1])
The minimum across rows is the estimate. It never underestimates but may overestimate due to hash collisions.
Parameters#
Width w = ⌈e/ε⌉ (e = Euler's number, ε = error tolerance)
Depth d = ⌈ln(1/δ)⌉ (δ = failure probability)
Example: ε = 0.01, δ = 0.01
w = 272 columns, d = 5 rows
Memory: 272 × 5 × 4 bytes ≈ 5.4 KB
Real-World Use Cases#
Cache Filtering#
CDN servers use bloom filters to decide what to cache:
Request arrives for resource X
→ bloom_filter.contains(X)?
→ NO: add X to filter (first request, don't cache)
→ YES: cache X (second+ request, worth caching)
Akamai reported this approach avoids caching one-hit wonders, which constitute ~75% of requests.
Spam and Malware Detection#
known_malicious_urls = BloomFilter(capacity=10_billion, fpr=0.001)
check_url(url):
if not known_malicious_urls.contains(url):
return SAFE # definitely not malicious
else:
return full_check(url) # might be malicious, do expensive check
Chrome's Safe Browsing uses this pattern — a local bloom filter avoids sending every URL to Google's servers.
Database Query Optimization#
LSM-tree databases (LevelDB, RocksDB, Cassandra) attach a bloom filter to each SSTable:
Query key K:
For each SSTable (newest to oldest):
if sstable.bloom_filter.might_contain(K):
search this SSTable # expensive disk read
else:
skip this SSTable # free, no I/O
This avoids reading SSTables that definitely don't contain the key — a massive performance win for read-heavy workloads.
Stream Deduplication#
seen = BloomFilter(capacity=100_million, fpr=0.01)
process_event(event):
if seen.contains(event.id):
return # likely duplicate, skip
seen.add(event.id)
handle(event)
At 1% FPR, you'll skip ~1% of unique events but guarantee no duplicate processing. For many use cases (analytics, logging), this trade-off is excellent.
Choosing the Right Structure#
| Structure | Answers | Error Type | Memory |
|---|---|---|---|
| Bloom filter | "Is X in set?" | False positives only | ~10 bits/element |
| Counting bloom | "Is X in set?" + delete | False positives only | ~40 bits/element |
| Cuckoo filter | "Is X in set?" + delete | False positives only | ~12 bits/element |
| HyperLogLog | "How many distinct?" | ±0.81% | ~12 KB fixed |
| Count-Min Sketch | "How often X?" | Overestimates only | Configurable |
Key Takeaways#
- Bloom filters provide set membership with zero false negatives and tunable false positives
- 10 bits per element gives ~1% FPR — sufficient for most applications
- Cuckoo filters beat bloom filters when you need deletion or low FPR
- HyperLogLog counts billions of unique items in 12 KB
- Count-Min Sketch tracks frequencies in bounded space
- These structures are everywhere: caches, databases, networking, analytics
This is article #226 in the Codelit system design series. Explore all articles at codelit.io.
Try it on Codelit
GitHub Integration
Paste any repo URL to generate an interactive architecture diagram from real code
Related articles
Try these templates
Cryptocurrency Exchange
Coinbase-like crypto exchange with order matching, wallet management, KYC, and real-time market data.
8 componentsData Warehouse & Analytics
Snowflake-like data warehouse with ELT pipelines, SQL analytics, dashboards, and data governance.
8 componentsRobinhood Trading Platform
Commission-free stock trading with real-time market data, order execution, portfolio management, and regulatory compliance.
10 componentsBuild this architecture
Generate an interactive architecture for Bloom Filters & Probabilistic Data Structures in seconds.
Try it in Codelit →
Comments