[Bro-Dev] Scaling out bro cluster communication

Thu Feb 9 12:21:09 PST 2017

I've been thinking about ideas for how to better scale out bro cluster communication and how that would look in scripts.

Scaling out sumstats and known hosts/services/certs detection will require script language or bif changes.

What I want to make possible is client side load balancing and failover for worker -> manager/datanode communication.

I have 2 ideas for how things could work.  

## The implicit form, new bifs like:

      send_event(dest: string, event: any);
      send_event_hashed(dest: string, hash_key: any, event: any);

      send_event("datanode", Scan::scan_attempt(scanner, attempt));
      send_event_hashed("datanode", scanner, Scan::scan_attempt(scanner, attempt));

## A super magic awesome implicit form

    global scan_attempt: event(scanner: addr, attempt: Attempt)
        &partition_via=func(scanner: addr, attempt: Attempt) { return scanner; } ;

The implicit form fits better with how bro currently works, but I think the explicit form would ultimately make cluster aware scripts simpler.

The difference hinges on the difference between the implicit and explicit communication.

Currently all bro cluster communication is implicit:

* You send logs to the logger/manager node by calling Log::write
* You send notices to the manager by calling NOTICE
* You can share data between nodes by marking a container as &synchronized.
* You can send data to the manager by redef'ing Cluster::worker2manager_events

The last two are what we need to replace/extend.

As an example, in my scan.bro I want to send scan attempts up to the manager for correlation, so this means:

    # define event
    global scan_attempt: event(scanner: addr, attempt: Attempt);

    # route it to the manager
    redef Cluster::worker2manager_events += /Scan::scan_attempt/;

    # only handle it on the manager
    @if ( Cluster::local_node_type() == Cluster::MANAGER )
    event Scan::scan_attempt(scanner: addr, attempt: Attempt)
        {
        add_scan_attempt(scanner, attempt);
        }
    @endif

and then later in the worker code, finally

    # raise the event to send it down to the manager.
    event Scan::scan_attempt(scanner, attempt);

If bro communication was more explicit, the script would just be

    # define event and handle on all nodes
    global scan_attempt: event(scanner: addr, attempt: Attempt);
    event Scan::scan_attempt(scanner: addr, attempt: Attempt)
        {
        add_scan_attempt(scanner, attempt);
        }

    # send the event directly to the manager node
    send_event("manager", Scan::scan_attempt(scanner, attempt));

Things like scan detection and known hosts/services tracking are easily partitioned, so if you had two datanodes for analysis:

    if (hash(scanner) % 2 == 0)
      send_event("datanode-0", Scan::scan_attempt(scanner, attempt));
    else
      send_event("datanode-1", Scan::scan_attempt(scanner, attempt));

Which would be wrapped in a function:

      send_event_hashed("datanode", scanner, Scan::scan_attempt(scanner, attempt));

that would handle knowing how many active nodes there are and doing proper consistent hashing/failover, something like this:

    function send_event_hashed(dest: string, hash_key: any, event: any) {
        data_nodes = |Cluster::active_nodes[dest]|; # or whatever
        node = hash(hash_key) % data_nodes;
        node_name = Cluster::active_nodes[node]$name;
        send_event(node_name, event);
    }

-- 
- Justin Azoff