[Bro-Dev] [Bro-Commits] [git/bro] topic/actor-system: First-pass broker-enabled Cluster scripting API + misc. (07ad06b)

Jan Grashöfer jan.grashoefer at gmail.com
Mon Nov 6 05:18:06 PST 2017


On 03/11/17 21:05, Azoff, Justin S wrote:
> I've been thinking the same thing, but I hope it doesn't come to that.  Ideally people will be able
> to scale their clusters by just increasing the number of data nodes without having to get into
> the details about what node is doing what.
> 
> Partitioning the data analysis by task has been suggested.. i.e., one data node for scan detection,
> one data node for spam detection, one data node for sumstats.. I think this would be very easy to
> implement, but it doesn't do anything to help scale out those individual tasks once one process can
> no longer handle the load.  You would just end up with something like the scan detection and spam
> data nodes at 20% cpu and the sumstats node CPU at 100%

I would keep the particular data-services scalable but allow the user to 
specify their distribution across the data nodes. As Jon already wrote, 
it could look like this (I added Spam and Scan pools):

[data-1]
type = data
pools = Intel::pool

[data-2]
type = data
pools = Intel::pool, Scan::pool

[data-3]
type = data
pools = Scan::pool, Spam::pool

[data-4]
type = data
pools = Spam:pool

However, this approach likely results in confusing config files and, as 
Jon wrote, it's hard to define a default configuration. In the end this 
is an optimization problem: How to assign data-services (pools) to data 
nodes to get the best performance (in terms of speed, memory-usage and 
reliability)?

I guess there are two possible approaches:
1) Let the user do the optimization, i.e. provide a possibility to 
assign data services to data nodes as described above.
2) Let the developer specify constraints for the data service 
distribution across data nodes and automatize the optimization. The 
minimal example would be that for each data service a minimum and 
maximum or default number of data nodes is specified (e.g. Intel on 1-2 
nodes and Scan detection on all available nodes). More complex 
specifications could require that a data service isn't scheduled on data 
nodes together with (particular) other services.

Another thing that might need to be considered are deep clusters. If I 
remember correctly, there has been some work on that in context of 
broker. For a deep cluster there might be even hierarchies of data nodes 
(e.g. root-intel-nodes managing the whole database and 
2nd-level-data-nodes serving as caches for worker-nodes on per site level).

Jan


More information about the bro-dev mailing list