[Bro-Dev] early performance comparisons of CAF-based run loop

Wed Apr 12 20:19:47 PDT 2017

> On Apr 12, 2017, at 9:05 PM, Siwek, Jon <jsiwek at illinois.edu> wrote:
> 
> 
>> On Apr 12, 2017, at 1:35 PM, Slagell, Adam J <slagell at illinois.edu> wrote:
>> 
>> Justin asked an interesting question today, how does this affect performance on the manager? That is where we are feeling a lot of pain with select().
> 
> If you mean the select() that’s in the process fork’d by the old RemoteSerializer code, you’d still see the same problems with the CAF-based runloop.  But that code is irrelevant once Broker takes its place. i.e. to answer that question, you need to design a communication stress test using Broker-based Bros as that’s more relevant than just changing the main loop.

Yep, that select stuff.  My question was mostly about the different workloads in a bro cluster.

Something that may be optimized for a worker dealing with 1 pktsrc and 2 peers may not be as optimal for a logger/manager that has no pktsrc but 100+ worker connections.  I've often wondered if the event loop should have a hint somewhere about which kind of process is running so it can optimize for throughput vs multiplexing many peers.

> Eventually, I can also imagine the Broker-based communication being more tightly integrated into the CAF-based runloop helping improve performance over the current Broker integration method.  Either way, what needs to be measured is how CAF’s multiplexer performs in relation to Bro’s communication patterns, but maybe still want to wait for the Broker improvements to wrap up before looking into doing those tests.
> 
> In the near-term, I can make a totally separate code branch that simply replaces select() with epoll.  Then, if Justin were to test it and find it alleviates performance pains on the manager, it could potentially get merged into bro/master ahead of the any of the pending broker/caf/runloop projects since it should be a trivial and safe change to do.  Let me know.

Ah.. I had actually started trying to do that a long time ago, but gave up because broker was going to replace all of that code anyway.

https://github.com/bro/bro/commits/topic/jazoff/select-to-poll

from what I recall the first commit seemed to work but the second broke something.

The thing that always stood out to me was that the manager would run select across all the worker sockets, and then loop over each worker and run CanRead, which just ran select again on each individual FD.

One issue a few people have run into on the manager is that select returns EINVAL and deadlocks bro if you give it a FD larger than 1024, which you currently hit on around a 200 node cluster (socket + flares use 4 or 5 FDs per worker).

-- 
- Justin Azoff