How to Implement a Bloom Filter in Go

Use the `hash/maphash` package to create a Bloom filter that supports probabilistic set membership with configurable false-positive rates.

How to Implement a Bloom Filter in Go

You're building a web crawler. It hits millions of URLs. You need to know if you've already visited a URL before fetching it. A map works for a few thousand items. It crashes your memory when you hit ten million. You need something smaller. Something that can say "probably yes" or "definitely no" while using a fraction of the RAM.

A Bloom filter is a probabilistic data structure for set membership. It never gives a false negative. If it says an item is absent, it is absent. It can give a false positive. If it says an item is present, it might be wrong. You trade accuracy for space.

Think of a long row of light switches, all off. You have a list of items. Each item corresponds to a unique set of switches. When you add an item, you flip its switches on. To check if an item exists, you look at its switches. If any switch is off, the item was never added. If all switches are on, the item was likely added, or another item happened to flip the same switches.

Go's hash/maphash package makes this fast. It provides high-performance hash functions designed for map lookups. It avoids the overhead of cryptographic hashing. You get speed and low memory usage.

The structure and constructor

Here's the core type and constructor. The filter holds a byte slice for the bits and a slice of seeds for the hash functions. Each seed produces a distinct hash function.

package main

import (
	"hash/maphash"
)

// BloomFilter tracks set membership with configurable false-positive rate.
type BloomFilter[V any] struct {
	hasher maphash.Hasher[V]
	seeds  []maphash.Seed
	bytes  []byte
}

// NewBloomFilter allocates the bit array and generates random seeds.
func NewBloomFilter[V any](hasher maphash.Hasher[V], n int, fpProb float64) *BloomFilter[V] {
	// calibrate computes optimal byte count and hash function count.
	nbytes, nseeds := calibrate(n, fpProb)

	// Each seed creates an independent hash function to minimize collisions.
	seeds := make([]maphash.Seed, nseeds)
	for i := range nseeds {
		seeds[i] = maphash.MakeSeed()
	}

	// bytes stores the bits; each byte holds 8 bits.
	return &BloomFilter[V]{
		hasher: hasher,
		bytes:  make([]byte, nbytes),
		seeds:  seeds,
	}
}

The generic parameter V any allows the filter to work with strings, integers, or structs. The hasher parameter accepts a maphash.Hasher[V]. This interface defines how to hash the value. Go provides maphash.ComparableHasher[T] for comparable types. You pass the hasher when you create the filter.

The calibrate function determines the size. You tell the filter how many items you expect (n) and how much error you can tolerate (fpProb). It calculates the number of bytes and the number of hash functions. More items or a lower error rate means more bytes. The math comes from the standard Bloom filter equations.

Bloom filters trade truth for space. Accept the trade or pick a map.

Insert and check logic

Here's the core logic. Insert marks bits as set. Contains checks if bits are set. The receiver name bf follows Go convention: short names matching the type.

// Insert marks the bits corresponding to v as set.
func (bf *BloomFilter[V]) Insert(v V) {
	for _, seed := range bf.seeds {
		// locate computes the byte index and bit mask for this hash.
		index, bit := bf.locate(seed, v)
		// Set the bit using bitwise OR.
		bf.bytes[index] |= bit
	}
}

// Contains returns true if all bits for v are set.
func (bf *BloomFilter[V]) Contains(v V) bool {
	for _, seed := range bf.seeds {
		index, bit := bf.locate(seed, v)
		// If any bit is zero, the item is definitely absent.
		if bf.bytes[index]&bit == 0 {
			return false
		}
	}
	// All bits are set; the item is likely present.
	return true
}

// locate maps a hash to a byte index and bit mask.
func (bf *BloomFilter[V]) locate(seed maphash.Seed, v V) (uint64, byte) {
	var h maphash.Hash
	h.SetSeed(seed)
	// Hash the value using the specific seed.
	bf.hasher.Hash(&h, v)
	hash := h.Sum64()

	// reduce maps the hash to [0, len(bytes)).
	index := reduce(hash, uint64(len(bf.bytes)))
	// Extract the bit position within the byte.
	mask := byte(1 << (hash % 8))
	return index, mask
}

Insert loops over the seeds. Each seed produces a hash. The hash maps to a byte index and a bit mask. The bit is set using |=. Contains does the same loop. If any bit is zero, it returns false immediately. This is the short-circuit that makes checks fast. If all bits are one, it returns true.

The locate function uses maphash.Hash. You create a hasher, set the seed, hash the value, and get a 64-bit result. The reduce function maps the hash to the byte array size. The bit mask extracts the position within the byte. hash % 8 gives a value from 0 to 7. 1 << position creates the mask.

Seeds generate hashes. Hashes set bits. Bits answer questions.

The math and helpers

Here's the calibration math and a runnable main. The reduce function uses the multiplication method to map hashes. This avoids modulo bias. The calibrate function implements the standard equations.

import (
	"fmt"
	"math"
	"math/bits"
)

// reduce maps hash to a value in [0, n) using the multiplication method.
func reduce(hash, n uint64) uint64 {
	// Use 64-bit multiplication to avoid modulo bias.
	if n > 1<<32-1 {
		hi, _ := bits.Mul64(hash, n)
		return hi
	}
	// Fast path for smaller n.
	return hash >> 32 * n >> 32
}

// calibrate computes optimal size and hash count from the math.
func calibrate(n int, fpProb float64) (bytes, seeds int) {
	if n == 0 {
		return 1, 1
	}

	// m = -(n * ln(p)) / (ln(2))^2
	logFpProb := math.Log(fpProb)
	m := -(float64(n) * logFpProb) / (math.Ln2 * math.Ln2)

	// Convert bits to bytes, rounding up.
	bytes = int(m) / 8
	if float64(bytes*8) < m {
		bytes++
	}

	// k = (m/n) * ln(2)
	k := -logFpProb / math.Ln2
	seeds = max(int(math.Round(k)), 1)
	return bytes, seeds
}

func main() {
	// Create a filter for 1000 items with 1% false positive rate.
	f := NewBloomFilter(maphash.ComparableHasher[string]{}, 1000, 0.01)

	f.Insert("apple")
	f.Insert("banana")

	// Check membership.
	if f.Contains("apple") {
		fmt.Println("apple is present")
	}
	if !f.Contains("cherry") {
		fmt.Println("cherry is absent")
	}
}

The reduce function maps a hash to an index. Modulo arithmetic introduces bias. The multiplication method distributes values more evenly. It multiplies the hash by n and takes the high bits. bits.Mul64 returns the high and low 64 bits of the 128-bit product. The high bits give the index.

The calibrate function uses the standard formulas. m is the number of bits. k is the number of hash functions. The formulas minimize the false positive rate for a given size. math.Log is the natural logarithm. math.Ln2 is ln(2). The function converts bits to bytes and rounds up.

If you forget to import hash/maphash, the compiler rejects the program with undefined: maphash. If you pass a type that doesn't match the hasher, you get a type error like cannot use ... as maphash.Hasher. The compiler catches these mistakes early.

Calibrate once. Insert many. Check often.

Realistic usage

Here's how a Bloom filter fits into a real service. It acts as a fast rejection layer. The filter protects the database, not the truth.

// CacheLayer uses a Bloom filter to skip database lookups.
type CacheLayer struct {
	filter *BloomFilter[string]
	db     *Database
}

// Get checks the filter before querying the database.
func (cl *CacheLayer) Get(key string) (string, error) {
	// Fast path: filter says absent, so DB definitely doesn't have it.
	if !cl.filter.Contains(key) {
		return "", ErrNotFound
	}

	// Slow path: filter says present, verify in DB.
	val, err := cl.db.Get(key)
	if err != nil {
		return "", err
	}

	// Insert into filter to speed up future checks.
	cl.filter.Insert(key)
	return val, nil
}

The CacheLayer checks the filter first. If the filter says the key is absent, the function returns immediately. This avoids a database query. If the filter says the key is present, the function queries the database. The database is the source of truth. The filter is a hint.

This pattern works well for caches, spam filters, and deduplication. It reduces load on downstream services. The false positive rate determines how often you do unnecessary work. A 1% false positive rate means 1 in 100 checks hits the database unnecessarily. You tune the rate based on your cost model.

Error handling follows Go convention. if err != nil { return "", err } makes the unhappy path visible. The boilerplate is verbose by design. It keeps errors explicit.

The filter protects the database, not the truth.

Pitfalls and conventions

Bloom filters have limitations. Standard Bloom filters don't support deletion. Once a bit is set, you can't unset it without potentially breaking other items. If you need deletion, use a Counting Bloom filter or a different structure.

The false positive rate assumes you insert exactly n items. If you insert more, the error rate grows. The filter fills up. Bits collide more often. You get more false positives. Plan your capacity. If the dataset grows, rebuild the filter or use a scalable variant.

hash/maphash is fast but not cryptographic. It's designed for map lookups. It's safe for Bloom filters. Don't use it for security-sensitive hashing. Use crypto/sha256 or crypto/blake2b for that.

Go conventions apply. Run gofmt on your code. It formats consistently. Most editors run it on save. Don't argue about indentation. Let the tool decide. Receiver names are short. (bf *BloomFilter) is standard. (this *BloomFilter) or (self *BloomFilter) are not. Public names start with a capital letter. Private names start lowercase. No keywords like public or private.

You cannot delete from a standard Bloom filter. Design your lifecycle around that constraint.

When to use a Bloom filter

Use a Bloom filter when you need to check membership for millions of items with limited memory and can tolerate occasional false positives. Use a map when you need exact membership, deletion support, or the dataset fits comfortably in memory. Use a bitset when you have a dense range of small integers and need O(1) exact lookups. Use a Counting Bloom filter when you need to delete items from the set. Use a Cuckoo filter when you need deletion support and better space efficiency than a Counting Bloom filter.

Probabilistic structures save RAM. Exact structures save sanity. Pick based on your constraints.

Where to go next