[Bro-Dev] [Bro-Commits] [git/bro] topic/actor-system: First-pass broker-enabled Cluster scripting API + misc. (07ad06b)

Thu Nov 2 11:37:46 PDT 2017

My view:

I have again and again encountered 4 types cases while doing script/pkg work:

1) manager2worker: Input-framework reads external data and all workers need to see it. 
	examples: intel-framework, 
2) worker2manager: workers see something report to manager, manager keeps
aggregated counts to make decisions
	example: scan-detection 
3) worker2manager2all-workers: workers see something, send to manager, manager
distributes to all workers
	example: tracking clicked URLs from extracted from email 

Basically, Bro has two kinds of heuristic needs

a) Cooked data analysis and corelations -  cooked data is the data which ends up
in logs - basically the entire 'protocol record' example c$http or c$smtp -
these are majority. 

Cooked data processing functionality can be also interpreted, for simplicity) as
: 
	tail -f blah.log | ./python-script 

	but inside bro. 

b) Raw or derived data - which you need to extract from traffic with a defined
policy of your own (example - extracted URLs from email tapping into
mime_data_all event) or extracting mac addresses from router
advertisements/solicitation events or something which is not yet in ::Info
record or a new 'thing' - this should be rare and few use cases over time. 

So in short, give me reliable events which are simply tail -f log functionality
on a data/processing node. It will reduce the number of syncronization needs by
order of magnitude(s). 

for (b) - raw or derived data, we can keep complexities of broker stores and
syncs. etc. but I have hopes that a refined raw data could become its own log
easily and be processed as cooked data. 

So a lot of data centrality  issues related to cluster can go away with data
note which can handle a lot of cooked data related stuff for (1), (2) and in
somecases (3). 

Now, while Justins' multiple data nodes idea has specticular merits, I am not much fan of it. Reason being having multiple data-notes results in same sets of problems - syncronization, latencies, mess of data2worker, worker2data events etc etc. 

I'd love to keep things rather simple.  Cooked data goes to one (or more) datanodes (datastores). Just replicate for relibaility rather then pick and choose what goes where. 

Just picking up some things: 

> > In the case of broadcasting from a worker to all other workers, the reason why you relay via another node is only because workers are not connected to each other?  Do we know that a fully-connected cluster is a bad idea?  i.e. why not have a worker able to broadcast directly to all other workers if that’s what is needed?
> 
> Mostly so that workers don't end up spending all their time sending out messages when they should be analyzing packets.

Yes, Also, I have seen this can case broadcast stroms. Thats why I have always
used manager as a central judge on what goes. See, often same data is seen by
all workers. so if manager is smart, it can just send first instance to workers
and all other workers can stop announcing further. 

Let me explain: 

- I block a scanner on 3 connections. 
- 3 workers see a connection each - they each report to manager 
- manager says "yep scanner" sends note to all workers saying traffic from this
  IP is now uninteresting stop reporting. 
- lets say 50 workers
- total commnication events = 53 

If all workers send data to all workers a scanner hitting 65,000 hosts will be a
mess inside cluster. esp when scanners are hitting in ms and not seconds. 

Similar to this is another case. 

lets say 

-  I read 1 million blacklisted IPs from a file on manager.
- manager sends 1 million X 50 events ( to 50 workers)
- each worker needs to report if a blacklisted IP has touched network
- now imagine, if we want to keep a count of how many unique local IPs has each
  of these blacklisted IPs touched 
- and at what rate and when was first contact and when was last contact. 

(btw, I have a working script for this - so whatever new broker does, it needs
to be able to give me this functionality)

Here is a sample log:

#fields ts      ipaddr  ls      days_seen       first_seen      last_seen       active_for      last_active     hosts   total_conns     source
1509606970.541130       185.87.185.45   Blacklist::ONGOING      3       1508782518.636892       1509462618.466469       07-20:55:00     01-16:05:52     20      24      TOR
1509606980.542115       46.166.162.53   Blacklist::ONGOING      3       1508472908.494320       1509165782.304233       08-00:27:54     05-02:33:18     7       9       TOR
1509607040.546524       77.161.34.157   Blacklist::ONGOING      3       1508750181.852639       1509481945.439893       08-11:16:04     01-10:44:55     7       9       TOR
1509607050.546742       45.79.167.181   Blacklist::ONGOING      4       1508440578.524377       1508902636.365934       05-08:20:58     08-03:40:14     66      818     TOR
1509607070.547143       192.36.27.7     Blacklist::ONGOING      6       1508545003.176139       1509498930.174750       11-00:58:47     01-06:02:20     30      33      TOR
1509607070.547143       79.137.80.94    Blacklist::ONGOING      6       1508606207.881810       1509423624.519253       09-11:03:37     02-02:57:26     15      16      TOR

Aashish 

Aashish 

On Thu, Nov 02, 2017 at 05:58:31PM +0000, Azoff, Justin S wrote:
> 
> > On Nov 2, 2017, at 1:22 PM, Siwek, Jon <jsiwek at illinois.edu> wrote:
> > 
> > 
> >> On Nov 1, 2017, at 6:11 PM, Azoff, Justin S <jazoff at illinois.edu> wrote:
> >> 
> >> - a bif/function for efficiently broadcasting an event to all other workers (or data nodes)
> >> -  If the current node is a data node, just send it to all workers
> >> -  otherwise, round robin the event to a data node and have it send it to all workers minus the current node. 
> > 
> > In the case of broadcasting from a worker to all other workers, the reason why you relay via another node is only because workers are not connected to each other?  Do we know that a fully-connected cluster is a bad idea?  i.e. why not have a worker able to broadcast directly to all other workers if that’s what is needed?
> 
> Mostly so that workers don't end up spending all their time sending out messages when they should be analyzing packets.
> 
> >> If &synchronized is going away script writers should be able to broadcast an event to all workers by doing something like
> >> 
> >>   Cluster::Broadcast(Cluster::WORKERS, event Foo(42));
> >> 
> >> This would replace a ton of code that currently uses things like worker2manager_events+manager2worker_events+ at if ( Cluster::local_node_type() == Cluster::MANAGER )
> > 
> > The successor to &synchronized was primarily intended to be the new data store stuff, so is there a way to map what you need onto that functionality?  Or can you elaborate on an example where you think this new broadcast pattern is a better way to replace &synchronized than using a data store?
> > 
> > - Jon
> 
> I think a shared data store would work for most of the use cases where people are messing with worker2manager_events.
> 
> If all the cases of people using worker2manager_events+manager2worker_events to mimic broadcast functionality are really just
> doing so to update data then it does make sense to just replace all of that with a new data store.
> 
> How would something like policy/protocols/ssl/validate-certs.bro look with intermediate_cache as a data store?
> 
> 
> — 
> Justin Azoff
> 
> 
> _______________________________________________
> bro-dev mailing list
> bro-dev at bro.org
> http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev