[Bro-Dev] Performance Enhancements

Jim Mellander jmellander at lbl.gov
Fri Oct 6 14:59:51 PDT 2017


I particularly like the idea of an allocation pool that per-packet
information can be stored, and reused by the next packet.

There also are probably some optimizations of frequent operations now that
we're in a 64-bit world that could prove useful - the one's complement
checksum calculation in net_util.cc is one that comes to mind, especially
since it works effectively a byte at a time (and works with even byte
counts only).  Seeing as this is done per-packet on all tcp payload,
optimizing this seems reasonable.  Here's a discussion of do the checksum
calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ -
this website also has an x64 allocator that is claimed to be faster than
tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note: I
haven't tried anything from this source, but find it interesting).

I'm guessing there are a number of such "small" optimizations that could
provide significant performance gains.

Take care,

Jim




On Fri, Oct 6, 2017 at 7:26 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:

>
> > On Oct 6, 2017, at 12:10 AM, Clark, Gilbert <gc355804 at ohio.edu> wrote:
> >
> > I'll note that one of the challenges with profiling is that there are
> the bro scripts, and then there is the bro engine.  The scripting layer has
> a completely different set of optimizations that might make sense than the
> engine does: turning off / turning on / tweaking different scripts can have
> a huge impact on Bro's relative performance depending on the frequency with
> which those script fragments are executed.  Thus, one way to look at
> speeding things up might be to take a look at the scripts that are run most
> often and seeing about ways to accelerate core pieces of them ... possibly
> by moving pieces of those scripts to builtins (as C methods).
> >
>
> Re: scripts, I have some code I put together to do arbitrary benchmarks of
> templated bro scripts.  I need to clean it up and publish it, but I found
> some interesting things.  Function calls are relatively slow.. so things
> like
>
>     ip in Site::local_nets
>
> Is faster than calling
>
>     Site::is_local_addr(ip);
>
> inlining short functions could speed things up a bit.
>
> I also found that things like
>
>     port == 22/tcp || port == 3389/tcp
>
> Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10
> ports.. Having the hash class fallback to a linear search when the hash
> only contains few items could speed things up there.  Things like
> 'likely_server_ports' have 1 or 2 ports in most cases.
>
>
> > If I had to guess at one engine-related thing that would've sped things
> up when I was profiling this stuff back in the day, it'd probably be
> rebuilding the memory allocation strategy / management.  From what I
> remember, Bro does do some malloc / free in the data path, which hurts
> quite a bit when one is trying to make things go fast.  It also means that
> the selection of a memory allocator and NUMA / per-node memory management
> is going to be important.  That's probably not going to qualify as
> something *small*, though ...
>
> Ah!  This reminds me of something I was thinking about a few weeks ago.
> I'm not sure to what extent bro uses memory allocation pools/interning for
> common immutable data structures.  Like for port objects or small strings.
> There's no reason bro should be mallocing/freeing memory to create port
> objects when they are only 65536 times 2 (or 3?) port objects... but bro
> does things like
>
>         tcp_hdr->Assign(0, new PortVal(ntohs(tp->th_sport),
> TRANSPORT_TCP));
>         tcp_hdr->Assign(1, new PortVal(ntohs(tp->th_dport),
> TRANSPORT_TCP));
>
> For every packet.  As well as allocating a ton of TYPE_COUNT vals for
> things like packet sizes and header lengths.. which will almost always be
> between 0 and 64k.
>
> For things that can't be interned, like ipv6 address, having an allocation
> pool could speed things up... Instead of freeing things like IPAddr objects
> they could just be returned to a pool, and then when a new IPAddr object is
> needed, an already initialized object could be grabbed from the pool and
> 'refreshed' with the new value.
>
> https://golang.org/pkg/sync/#Pool
>
> Talks about that sort of thing.
>
> > On a related note, a fun experiment is always to try running bro with a
> different allocator and seeing what happens ...
>
> I recently noticed our boxes were using jemalloc instead of tcmalloc..
> Switching that caused malloc to drop a few places down in 'perf top' output.
>
>
>> Justin Azoff
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/bro-dev/attachments/20171006/ff1037c9/attachment.html 


More information about the bro-dev mailing list