DEV Community

Cover image for Probabilistic Data Structures in Go: Building and Benchmarking a Bloom Filter
Umang Sinha
Umang Sinha

Posted on

Probabilistic Data Structures in Go: Building and Benchmarking a Bloom Filter

Imagine you're building a high-traffic web service that relies heavily on a distributed cache like Redis. For every incoming request, your service first checks the cache to avoid expensive database lookups. Sounds efficient, right?

But here’s a subtle problem at scale: Many of the keys you look up may not be in the cache!

Each of these negative lookups still involves a network round-trip to Redis, which, while fast, adds up when you're processing hundreds of thousands of misses per second.

Wouldn’t it be great if you could avoid those pointless lookups altogether?

This is where a Bloom filter comes in - a clever probabilistic data structure that lives in your application’s memory. It helps you quickly answer the following question:

"Is this key definitely not in the cache?"

If the Bloom filter says “no” → skip the Redis call entirely.

If it says “maybe” → go ahead and check Redis as usual. (It's important to understand that the 'maybe' case can sometimes lead to a false positive, but we'll delve into that later.)

It’s fast, tiny, and wildly effective at reducing unnecessary lookups in high-throughput systems. In fact, services at Google, Facebook, LinkedIn, and CDNs like Cloudflare all use Bloom filters to optimize performance at scale.

In this article, we’ll:

  • Break down how Bloom filters work (and the math behind them)
  • Walk through a full implementation in Go
  • Explore practical use cases, tuning strategies, and limitations

Whether you're optimizing caching logic or reducing database hits - Bloom filters are a tool worth knowing.


What is a Bloom Filter?

A Bloom filter is a probabilistic data structure used to test whether an element is a member of a set. It can return two possible answers:

  • “Definitely not present”
  • “Possibly present”

It’s fast, memory-efficient, and beautifully suited for large-scale systems where precision can be traded for performance.

Why “Probabilistic”?

Bloom filters are “probabilistic” because they allow false positives. That is, they might say an element is present even when it’s not. But the beauty is, they never give false negatives. If the filter says something isn't there, you can trust it.

This simple guarantee unlocks powerful optimizations in real-world systems.

A Bloom filter starts with:

  • A bit array of size m, all initialized to 0
  • k different hash functions

To add an element:

  1. Hash the element k times
  2. Each hash gives you an index from the bit array
  3. Set each of those k bits to 1

To check for an element:

  1. Hash it the same way
  2. If any of those k bits are 0, the element is definitely not in the set
  3. If all are 1, then it might be present

The more elements you add, the more bits get set, which increases the chances of collisions and hence, false positives.

Real-World Analogy:

Imagine you're trying to create a new Gmail account.

As soon as you enter your desired email address, Google instantly tells you whether it’s available or already taken. That check feels instantaneous, but behind the scenes, it needs to search through hundreds of millions of existing email addresses.

Now, Google could query a database directly every time someone enters an email, but:

  • That would be costly at scale
  • It would expose the system to user enumeration risks if someone tries to brute-force existing usernames

So, how do you check availability quickly and securely?

One smart solution is to use a bloom filter.

  • Google can maintain a bloom filter that contains all existing Gmail usernames, either fully or partially synced across regions.
  • When you enter a new username, the frontend or a lightweight backend service checks the bloom filter first: If the filter says “definitely not present”, the username is available. If it says “might be present”, Google can then do a deeper, more secure database lookup to confirm.

This two-tiered approach:

  • Reduces load on internal services
  • Speeds up user feedback
  • Avoids leaking user data through timing differences

It’s a perfect example of how Bloom filters trade a small chance of a false positive for dramatic speed and scale benefits, especially in global systems like Google’s.


How Bloom Filters Work Internally:

Bloom filters are elegant data structures that offer a blend of mathematical simplicity, space efficiency, and speed. In this section, we’ll explore how they actually work under the hood.

How to add an item to the bloom filter?

A bloom filter starts with a fixed-size bit array of length m, with all bits initialized to 0.

Bloom Filter initialized

Each bit in this array represents whether a certain bit has been set by an inserted item.

To insert an item into a bloom filter:

  1. Hash the item using k different hash functions.
  2. Each hash maps the item to a position i in the bit array.
  3. Set the bits at all k positions to 1.

Example: inserting "bloom" with k = 3 might yield indices 1, 4, and 6. After insertion, the bit array looks like this:

Bloom Filter insertion

How to check membership?

To check if an item might exist in the bloom filter:

  1. Hash the item with the same k hash functions.
  2. Inspect the k positions in the bit array.
  • If any of the bits is 0 → the item is definitely not in the set.
  • If all are 1 → the item is possibly in the set.

So, bloom filters give you a guaranteed no or a maybe yes.

If we query "bloom" again, we’ll get the same indices 1, 4, and 6. All are set to 1, so the bloom filter returns true, meaning "bloom" might be present. And in this case, it’s correct.

Bloom filter check

Now, let’s insert another word - "filter". Suppose its hash functions return indices 4, 5, and 7. The updated bit array becomes:

Bloom filter second insertion

Notice the collision at index 4? Both "bloom" and "filter" caused that bit to be set. There’s no way to tell whether it was set by "bloom" or "filter". This ambiguity is at the heart of how bloom filters trade precision for speed and space.

False positives:

Since Bloom filters rely on shared bit positions, it’s possible for unrelated items to appear present just by chance.

Imagine querying a word, "maybe" whose hash functions return the indices as 1, 4 and 7.

In our case, "maybe" would pass the membership check because the bits it checks were already set by other words, even though we never inserted "maybe".

Bloom filter collision

This is called a false positive - the filter falsely claims the item might exist.

Bloom filters never produce false negatives, but they can produce false positives.

This tradeoff is exactly what makes Bloom filters powerful: they allow blazing-fast lookups and save memory, in exchange for a small, tunable error rate.

The math behind it:

To understand the false positive rate of a Bloom filter, we need to answer one question: What is the probability that all k hash bits for a queried item are set to 1, even though the item was never inserted?

Let’s break it down step by step.

Step 1: Probability a bit is still 0 after n insertions

Each insertion sets k bits in the array. So, after inserting n items, we’ve made kn bit assignments (some of which might overlap).

Each bit has this probability of remaining unset after one random write:

1-1/m

After kn independent hashings, the probability that a specific bit is still 0 becomes:

(1-1/m)^{kn}

Using the identity:

(1-1/m)^{kn} = e^{-kn/m}

for large m, we can approximate:

P(bit~is~0) = e^{-kn/m}

Step 2: Probability a bit is 1

So, the probability that a bit is set to 1 is:

P(bit~is~1) = 1 - e^{-kn/m}

Step 3: False Positive Probability

Now, for a false positive to occur, all k bits checked during a query must be 1 (even though the item was never inserted). Assuming the hash functions are independent:

P(false~positive) = (1-e^{-kn/m})^k

This is the false positive rate - the core metric in bloom filter design.

What Affects the False Positive Rate?

  • n → number of items inserted
  • m → size of the bit array
  • k → number of hash functions

The goal is to choose m and k wisely for a given n so that the false positive rate stays acceptably low.

Optimal bit array size m for target false positive rate p:

m = (-n * ln(p))/(ln2)^2

There’s even an optimal value of k for given m and n:

k = (m/n)ln2

You can precompute m and k when initializing your Bloom filter if you know n and your acceptable p. This minimizes the false positive rate.

Asymptotic Complexity of bloom filters:

Despite their probabilistic nature, bloom filters offer impressive performance guarantees. Let’s break them down:

Time complexity:

time complexity

Both operations involve:

  • Running k hash functions
  • Accessing k bits in the bit array

Since k is typically a small constant (e.g., 7-10), both insert and lookup are effectively O(1) in practice.

Space complexity:

A Bloom filter with:

  • m bits of storage
  • n inserted elements
  • k hash functions

Uses a total of O(m) bits of memory.

In summary:

  • A larger bit array (m) reduces the false positive rate but increases memory usage.
  • More hash functions (k) improve precision up to a point, but after that, they degrade performance and increase collisions.
  • The number of inserted elements (n) directly affects accuracy - a bloom filter degrades as it gets “full”.

Bloom filters strike a beautiful balance: tiny space, lightning-fast access, and tunable accuracy.


Implementing a Bloom Filter in Go

There’s no better way to understand Bloom filters than to implement one. Instead of building one from scratch here, we’ll walk through the internals of bitbloom - a fast, thread-safe Bloom filter package I wrote in Go with performance and simplicity in mind.

We’ll dissect the core functionality: initialization, hashing, insertion, lookups, and more, covering not just what the code does, but why it's designed that way. The goal is to showcase how a clean implementation can scale while remaining easy to reason about.

Creating a Bloom Filter: bitbloom.New()

creating a bloom filter

This initializes the filter for 10,000 expected elements and a 1% false positive rate. The library handles all the math for you.

Internally:

  • Bitset size m and number of hash functions k are computed based on standard formulas.
  • Values are clamped to safe ranges.
  • The hash functions are seeded deterministically.

This ensures accuracy without wasting memory or CPU.

Optimal Parameter Calculations:

Bloom filters rely on mathematical precision for tuning space and accuracy:

optimal params calculation

In bitbloom, these are implemented as:

optimal params functions

These help you reason about the internal capacity of your filter and are exported so users can use them manually if needed.

Adding Items: Add(data []byte)

Adding an item to the filter involves hashing the data k times and setting the resulting k positions in the bitset to 1. Here’s the simplified flow:

add items

  • We lock with a write mutex since the bitset will be mutated.
  • The bf.hasher.Hashes() function returns k positions using a MurmurHash-based mechanism.
  • We set those bit positions.
  • The item count is incremented for optional introspection.

Testing Items for membership: Test(data []byte)

Checking if an item might be in the filter is similar - we compute the same k hash positions and verify that all are set:

test for membership

  • We use a read lock for high concurrency.
  • If any of the positions are unset, we’re certain the item wasn’t added.
  • If all positions are set, we might have seen the item before - a false positive is possible but bounded by p.

Thread Safety and Internals:

The filter is guarded by a sync.RWMutex, allowing multiple reads but exclusive writes - enabling safe concurrent usage even under heavy load. Here's what the struct looks like:

main bloomfilter struct

The hasher interface provides flexibility. The default implementation uses MurmurHash3 for speed and uniformity.

MurmurHash-Based Hashing:

Instead of relying on Go’s built-in hashes (which are not portable or deterministic), bitbloom implements a MurmurHash3-based custom hasher built using the excellent murmur library by Sébastien Paolacci (https://github.com/spaolacci/murmur3).

The interface:

hasher interface

This is both:

  • Fast: MurmurHash is extremely performant and produces well-distributed hashes.
  • Deterministic: Same input → same positions → consistent behaviour.

The hash values are derived in a way similar to double hashing, which reduces the cost of generating k distinct hashes from just 2 base hashes.

Concurrency Considerations:

Go encourages writing concurrent programs and bitbloom is designed to embrace that.

Key design points:

  • sync.RWMutex for safe parallel reads.
  • Internal locking means you don’t need to wrap access in your own sync.

This allows users to run millions of goroutines hitting the filter simultaneously with minimal contention, as demonstrated in the stress test section later.

Install & Use:

go get github.com/umang-sinha/bitbloom

In the next section, we’ll push this filter to the limits with a real-world stress test, hitting millions of ops per second across thousands of goroutines.


Performance & Accuracy Benchmarking:

A Bloom filter is a probabilistic data structure, but it shouldn't be probabilistically slow.

I built bitbloom with performance and concurrency in mind. But how well does it hold up under serious load?

Let’s benchmark the library across three fronts:

  1. Raw throughput (ops/sec under concurrency)
  2. False positive accuracy (does it match the theoretical bound?)
  3. Memory usage

Stress Testing: Ops per Second

I wrote a Go program that:

  • Spawns 1,000+ goroutines
  • Each performs thousands of Add() and Test() ops
  • Shares a single Bloom filter instance
  • Times the execution and prints throughput

main function benchmark

Example outputs on an 8-core machine:

bloom filter benchmark 1

bloom filter benchmark 2

The Bloom filter consistently performs in the 2.2M–2.6M ops/sec range under realistic conditions.

Accuracy Test: False Positive Rate

I inserted 1,000,000 unique items, and then tested 1,000,000 unseen keys to measure the false positive rate:

false positive rate code

The result:

false positive rate result

Expected result: Around ~1% (since we set p = 0.01)

Observed result: Matches closely across runs

Benchmarking memory usage:

I ran some benchmarks using runtime.ReadMemStats to compare memory usage just after initialization and after a million insertions.

memory usage benchmark code

The output:

memory usage benchmark result

This tells you:

  • How much heap is allocated (Alloc)
  • How much total memory has been used (TotalAlloc)
  • How much has been reserved by Go (Sys)
  • How many GC cycles ran

The actual in-use memory (Alloc) stays extremely low (~2.2 MB), which makes sense since:

  • The Bloom filter’s memory usage depends only on m (size of bitset) and not on number of insertions.
  • I am using a compact bitset (e.g., m bits ≈ ~1–2 million bits → ~0.25 MB).

100MB TotalAlloc is from temporary allocations during:

  • Hashing operations
  • Slice copying
  • Other per-insertion allocations

Go’s GC is working well - 38 GCs for 1M insertions is not excessive, and memory use is staying bounded.


Real-World Applications:

Bloom filters aren’t just a theoretical curiosity - they’re used extensively in high-performance systems to reduce latency, avoid unnecessary computation, and optimize memory usage. Here are some real-world use cases where Bloom filters shine:

1. Databases and Storage Engines

Many databases use Bloom filters to minimize costly disk lookups. For example, Apache HBase employs Bloom filters at the block level to quickly determine whether a row or column might exist in a file, avoiding unnecessary reads from disk. Similarly, Cassandra and LevelDB rely on Bloom filters to reduce the number of SSTables accessed during queries.

2. Caching Systems

Bloom filters are often used as a read-through cache guard to prevent cache penetration. Suppose a backend cache (like Redis or Memcached) receives frequent requests for keys that don’t exist. A Bloom filter can sit in front and quickly filter out non-existent keys, reducing load on both the cache and the database behind it.

3. Content Delivery Networks (CDNs)

CDNs use Bloom filters at the edge to keep track of recently served content or known malicious URLs. For instance, Google Chrome's Safe Browsing API uses Bloom filters to store hashes of unsafe URLs, allowing browsers to quickly check for potential threats without frequent server queries.

4. Security and Malware Detection

Security systems use Bloom filters to maintain large sets of known bad IPs, malware signatures, or suspicious domains. These can be checked quickly before performing expensive full-pattern matches or fetching threat intelligence data from external services.

5. Distributed Systems and Network Protocols

In distributed architectures, Bloom filters are used to reduce network chatter. A node might send a Bloom filter of its data set to another node so that the receiver can quickly determine which items it’s missing, without transferring full lists. Systems like Apache Hadoop use this pattern to reduce data shuffling during MapReduce jobs.


Limitations and Trade-offs

While Bloom filters are powerful and widely used, they’re not a one-size-fits-all solution. Like any engineering tool, they come with inherent trade-offs that must be carefully considered depending on your use case.

1. No Native Support for Deletion

Standard Bloom filters do not support deletions. Once an item is inserted, its presence cannot be definitively removed without risking the integrity of other items' presence bits. This limitation can be problematic in scenarios where your data set is dynamic and items frequently expire or get removed.

Mitigation:

A common solution is to use a Counting Bloom Filter, which replaces the underlying bit array with an array of counters. Each insertion increments the counters for the k hash positions, and each deletion decrements them. This adds memory overhead but enables safe removals at the cost of increased complexity and potential counter overflows.

2. False Positives Are Inevitable

Bloom filters never produce false negatives, but false positives are always a possibility. That means you might occasionally believe an item exists when it does not.

Choosing the right false positive rate (p):

This is critical and highly application-dependent. A lower false positive rate means:

  • More memory consumption
  • More hash functions (slower operations)

Conversely, a higher false positive rate reduces memory usage but increases the chance of incorrect "exists" results. Tuning p involves understanding the cost of false positives in your system. For example:

  • In a CDN cache, a false positive might mean serving stale content - probably tolerable.
  • In a security filter, a false positive might block a legitimate user - not acceptable.

3. Choosing n in a Dynamic World

Bloom filters require you to specify the expected number of insertions (n) up front. Underestimating n leads to higher false positive rates; overestimating wastes memory. This is challenging for dynamic workloads or systems where data volume changes over time.

Workaround:

You can use a scalable Bloom filter, which grows as needed by chaining multiple filters with increasing capacity and decreasing false positive targets. However, this adds design complexity.

Bloom filters offer an impressive blend of speed and space efficiency, but they are not magic. A well-engineered system must balance their strengths with their limitations, especially in high-availability or high-accuracy environments.


Conclusion:

Bloom filters are a powerful tool in an engineer’s toolkit - compact, fast, and surprisingly versatile. From preventing unnecessary database hits to guarding high-latency operations, they offer an elegant solution where approximate answers are good enough.

In this article, we explored the theory behind Bloom filters, walked through their internals, and benchmarked their performance across speed, memory, and accuracy. Along the way, we also walked through a fully concurrent Bloom filter in Go.

If you're looking to integrate a production-ready Bloom filter into your Go projects, take a look at my library, bitbloom - the library we walked through earlier. It’s fast, reliable, and designed with performance in mind.

Probabilistic data structures may not give you perfect answers, but in the right contexts, they’ll give you the right answers, fast.


Sources and further reading:

Top comments (0)