[Bro-Dev] scheduling events vs using &expire_func ?
asharma at lbl.gov
Thu Apr 19 07:03:52 PDT 2018
On Wed, Apr 18, 2018 at 01:46:08PM +0000, Azoff, Justin S wrote:
> How are you tracking slow scanners on the workers? If you have 50 workers and you
> are not distributing the data between them, there's only a 1 in 50 chance that you'll
> see the same scanner twice on the same worker, and a one in 2500 that you'd see
> 3 packets in a row on the same worker... and 1:125,000 for 4 in a row.
Yes, that was the observation and idea. If real slow scanners, (won't be way too many way
too soon (obviously)) let each one of their start time be tracked on workers.
ODDS of hitting same worker are too low, so burden of tracking start time for
1000's of slow scanner is distributed fairly even across 10/50 workers *instead*
of manager having to store all this.
So 600K slow scanners means |manager_table| +=600K vs |worker_table| = 600K/50
so burden of memory is more distributed.
I checked some numbers on my end = since midnight we flagged 172K scanners while
tracking 630K potential scanners which will eventually be flagged.
Issue isn't flagging these. Issue was being accurate on when was the very first
time we saw a IP connect to us and keep that in memory - this is not needed but
good to have stat.
> >> I'd suggest a different way to think about
> >> structuring the problem: you could Rendezvous Hash the IP addresses across
> >> proxies, with each one managing expiration in just their own table. In that
> >> way, the storage/computation can be uniformly distributed and you should be
> >> able to simply adjust number of proxies to fit the required scale.
> That doesn't simplify anything, that just moves the problem. You can only tail the single
> conn.log because the logger aggregated the records from all the workers. If the manager
> is running out of memory tracking all of the scanners, then the single instance of the python
> script is going to run into the same issue at some point.
I agree, but we digress on the actual issue here. see below.
> > So yes we can shed load from manger -> workers -> proxies. I'll try this
> > approach. But I think I am also going to try (with new broker-enabled cluster)
> > approach of sending all connections to one proxy/data-store and just do
> > aggregation there and see if that works out (the tail -f conn.log |
> > 'python-script' approach). Admittedly, this needs more thinking to get the right
> > architecture in the new cluster era!
> No.. this is just moving the problem again. If your manager is running out of memory and you
> move everything to one proxy, that's just going to have the same problem.
I think we've talked about this roughly over 2 years. I am probably
mis-understanding or may be unclear. the issue is complexity of aggregation due
to clusterization in scan-detection. now you can use many proxies, many data
nodes etc but as long as distributed nature of data is there, aggregation in
realtime is problem. Data needs to be concentrated at one place. A tail -f
conn.log is data concentrated at one place.
Now its a different issue that conn.log entry is 5+ seconds late which can
already miss a significant scan etc.
> The fix is to use the distributing message routing features that I've been talking about for a while
> (and that Jon implemented in the actor-system branch!)
> The entire change to switch simple-scan from aggregating all scanners on a single manager to
> aggregating scanners across all proxies (which can be on multiple machines) is swapping
aggregating across all proxies is still distributing data around. So the way I
see is you are moving the problem around :) But as I said, I don't know more how
this works since I haven't tried new broker stuff just yet.
> event Scan::scan_attempt(scanner, attempt);
> Cluster::publish_hrw(Cluster::proxy_pool, scanner, Scan::scan_attempt, scanner, attempt);
> (with some @ifdefs to make it work on both versions of bro)
I am *really* looking forward to trying this stuff out in new broker model.
More information about the bro-dev