[Bro-Dev] scheduling events vs using &expire_func ?

Aashish Sharma asharma at lbl.gov
Tue Apr 17 16:04:51 PDT 2018


For now, I am resorting to &expire_func route only. I think by using some more
heuristics in worker's expire functions for more aggregated stats, I am able to
shed load on manager where manager doesn't need to track ALL potential scanners. 

Lets see, I am running to see if new code works without exhausting memory for few days. 

Yes certainly, the following changed did address the manager network_time()
stall issues:

redef table_expire_interval = 0.1 secs ;
redef table_incremental_step=25 ;

Useful observation: if  you want to expire a lot of entires from a table/set,
expire few but expire often. 

I Still need to determine limits of both table_incremental_step,
table_expire_interval and this works for million or million(s) of entires. 

> on expiration scheme (e.g. if it's expiring on something other than create
> times, you're going to need a way to invalidate previously scheduled
> events).

Actually, in this case - I was more thinking in terms of let scheduled event
kick in and in that event I decide to further schedule a next one or delete the
entry from the table.  

> actual work required to call and evaluate the &expire_func code becomes too
> great at some point, so maybe first try decreasing `table_incremental_step`

Yes, It seems like that. I still don't know at what point. In previous runs it
appears after table had 1.7-2.3 Million entires. But then I don't think its
function of counts, but how much RAM i've got on the system. Somewhere in the
range is when manager ran out of memory. HOwever (as stated above), I was able
to come up with a little heuristics which still allows me to keep track of
really slow scanners, while not burdening manager but rather let load be on
workers. Simple observation that really slow scanners aren't going to have a lot
of connections allows to keep those in (few) worker table. This would
potentially be a problem if there really a LOT of very slow scanners. but,
still, those all get divided by number of workers we run. 

> I'd suggest a different way to think about
> structuring the problem: you could Rendezvous Hash the IP addresses across
> proxies, with each one managing expiration in just their own table.  In that
> way, the storage/computation can be uniformly distributed and you should be
> able to simply adjust number of proxies to fit the required scale.

I think above might work reasonable. 

So previously I was making manager keep count of potential scanners but now
moving that work instead to workers. New model would let us just move all this
to proxy(ies) and proxies can decide if delete or send to manager for
aggregation.

I suppose, given proxies don't process packets, it will be cheaper there to do
all this work.

Only thing bothers me is scan-detection is a complicated problem only because of
distribution of data in cluster. Its a lot simple problem if we could just do a
tail -f conn.log | ./python-script 

So yes we can shed load from manger -> workers -> proxies. I'll try this
approach. But I think I am also going to try (with new broker-enabled cluster)
approach of sending all connections to one proxy/data-store and just do
aggregation there and see if that works out (the tail -f conn.log |
'python-script' approach). Admittedly, this needs more thinking to get the right
architecture in the new cluster era! 

Thanks,
Aashish 




On Mon, Apr 16, 2018 at 10:32:45AM -0500, Jon Siwek wrote:
> 
> 
> On 4/13/18 6:14 PM, Aashish Sharma wrote:
> > I have a aggregation policy where I am trying to keep counts of number of
> > connections an IP made in a cluster setup.
> > 
> > For now, I am using table on workers and manager and using expire_func to
> > trigger worker2manager and manager2worker events.
> > 
> > All works great until tables grow to > 1 million after which expire_functions
> > start clogging on manager and slowing down.
> > 
> > Example of Timer from prof.log on manager:
> > 
> > 1523636760.591416 Timers: current=57509 max=68053 mem=4942K lag=0.44s
> > 1523636943.983521 Timers: current=54653 max=68053 mem=4696K lag=168.39s
> > 1523638289.808519 Timers: current=49623 max=68053 mem=4264K lag=1330.82s
> > 1523638364.873338 Timers: current=48441 max=68053 mem=4162K lag=60.06s
> > 1523638380.344700 Timers: current=50841 max=68053 mem=4369K lag=0.47s
> > 
> > So Instead of using &expire_func, I can probably try schedule {} ; but I am not
> > sure how scheduling events are any different internally then scheduling
> > expire_funcs ?
> 
> There's a single timer per table that continuously triggers incremental
> iteration over fixed-size chunks of the table, looking for entries to
> expire.  The relevant options that you can tune here:
> 
> * `table_expire_interval`
> * `table_incremental_step`
> * `table_expire_delay`
> 
> > I'd like to think/guess that scheduling events is probably less taxing. but
> > wanted to check with the greater group on thoughts - esp insights into their
> > internal processing queues.
> 
> I'm not clear on exactly how your code would be restructured around
> scheduled events, though guessing if you just did one event per entry that
> needs to be expired, it's not going to be better.  You would then have one
> timer per table entry (up from a single timer), or possibly more depending
> on expiration scheme (e.g. if it's expiring on something other than create
> times, you're going to need a way to invalidate previously scheduled
> events).
> 
> Ultimately, you'd likely still have the same amount of equivalent function
> calls (whatever work you're doing in &expire_func, would still need to
> happen).  With the way table expiration is implemented, my guess is that the
> actual work required to call and evaluate the &expire_func code becomes too
> great at some point, so maybe first try decreasing `table_incremental_step`
> or reducing the work that you need to do in the &expire_func.
> 
> With new features in the upcoming broker-enabled cluster framework (soon to
> be merged into git/master), I'd suggest a different way to think about
> structuring the problem: you could Rendezvous Hash the IP addresses across
> proxies, with each one managing expiration in just their own table.  In that
> way, the storage/computation can be uniformly distributed and you should be
> able to simply adjust number of proxies to fit the required scale.
> 
> - Jon


More information about the bro-dev mailing list