[Bro-Dev] [Bro-Commits] [git/bro] topic/actor-system: First-pass broker-enabled Cluster scripting API + misc. (07ad06b)
Jan Grashöfer
jan.grashoefer at gmail.com
Mon Nov 6 05:18:06 PST 2017
On 03/11/17 21:05, Azoff, Justin S wrote:
> I've been thinking the same thing, but I hope it doesn't come to that. Ideally people will be able
> to scale their clusters by just increasing the number of data nodes without having to get into
> the details about what node is doing what.
>
> Partitioning the data analysis by task has been suggested.. i.e., one data node for scan detection,
> one data node for spam detection, one data node for sumstats.. I think this would be very easy to
> implement, but it doesn't do anything to help scale out those individual tasks once one process can
> no longer handle the load. You would just end up with something like the scan detection and spam
> data nodes at 20% cpu and the sumstats node CPU at 100%
I would keep the particular data-services scalable but allow the user to
specify their distribution across the data nodes. As Jon already wrote,
it could look like this (I added Spam and Scan pools):
[data-1]
type = data
pools = Intel::pool
[data-2]
type = data
pools = Intel::pool, Scan::pool
[data-3]
type = data
pools = Scan::pool, Spam::pool
[data-4]
type = data
pools = Spam:pool
However, this approach likely results in confusing config files and, as
Jon wrote, it's hard to define a default configuration. In the end this
is an optimization problem: How to assign data-services (pools) to data
nodes to get the best performance (in terms of speed, memory-usage and
reliability)?
I guess there are two possible approaches:
1) Let the user do the optimization, i.e. provide a possibility to
assign data services to data nodes as described above.
2) Let the developer specify constraints for the data service
distribution across data nodes and automatize the optimization. The
minimal example would be that for each data service a minimum and
maximum or default number of data nodes is specified (e.g. Intel on 1-2
nodes and Scan detection on all available nodes). More complex
specifications could require that a data service isn't scheduled on data
nodes together with (particular) other services.
Another thing that might need to be considered are deep clusters. If I
remember correctly, there has been some work on that in context of
broker. For a deep cluster there might be even hierarchies of data nodes
(e.g. root-intel-nodes managing the whole database and
2nd-level-data-nodes serving as caches for worker-nodes on per site level).
Jan
More information about the bro-dev
mailing list