[Bro-Dev] Scaling out bro cluster communication

Wed Mar 8 16:26:39 PST 2017

Hi,

I was able to put together of a prototype of the functionality I had in mind.

I learned a bit more about broker including the Broker::send_event function vs. the auto_event method.

send_event doesn't let you pick the node you want to send data to, but it does let you pick the queue.  By creating multiple nodes and subscribing each node to it's own queue I was able to achieve the end result I wanted.  It boiled down to this:

function send_event_hashed(key: any, args: Broker::EventArgs)
{
    local destination_count = node_count; #FIXME: how to figure out dynamically
    local dest = 1+ md5_hash_count(key) % destination_count;
    local queue = fmt("bro/data/%s", dest);
    print fmt("Send hash(%s)=%s: %s", key, queue, args);
    Broker::send_event(queue, args);
}

I have the full example here https://github.com/JustinAzoff/broker_distributed_events

It implements a fake known hosts and scan detection policy.

the main things to figure out is:

* How to work out the proper node_count at runtime.  I think on a real bro cluster the Cluster namespace has the data I need for this, including which nodes are reachable.

* How to handle one node becoming unreachable or a new node showing up.  Ideally bro would use a form of consistent ring hashing.

If this were worked out, and implemented for logging as well, you could run a bro cluster with 2 'manager' nodes and have a fully functioning cluster even if one of them died.

As is, I can probably use this on our test cluster to run 4 data nodes and distribute scan detection to 4 cpu cores.

The example doesn't show it, but for things like the known hosts tracking it would be useful if the data could be replicated to the other data nodes.  Because the sender-side hash based distribution also acts to de-duplicate the data, the replication would not be latency sensitive. It would not have the problem that the current known hosts policy has where 2 nodes can detect and log a new host before the data synchronizes.  As long as the data replicated before a node outage occurred, you would get consistent logs.

-- 
- Justin Azoff