[Bro-Dev] Bloom filter merging

Matthias Vallentin vallentin at icir.org
Mon Jun 10 21:23:03 PDT 2013


I noticed a slight complication while implementing the Bloom filter
merge functionality: in order for Bloom filters to be mergeable, they
must have been filled with the same hash functions. Consequently,
Bloom filters need to share their internal hash functions for
meaningful results. This means that we need to preallocate them
because their seed depends on calls to bro_random(), that is, if we
instantiate a hash function later in the game, we may not get the same
one across all cluster nodes. It's also unclear how many hash
functions we need, I've seen corner cases of over 40 instances
depending on poor parametrization, though the common case is around
4-8. The hash function H3 takes on space o(2304B), so maintaining a
global pool of, say, 100 hash functions to be on the safe side would
cost us o(2MB). For example, if a Bloom filter needs 5 hash functions,
it will always grab the first 5 from the pool to ensure a
deterministic setup. The pool needs to be instantiated directly after
the seed has been initialized.

Instead of allocating a fixed number of hash functions, we could also
set aside a fixed number of seeds and use them to instantiate hash
functions as we need them. Does that sound like a good strategy to
pursue?

    Matthias


More information about the bro-dev mailing list