<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">There also are probably some optimizations of frequent operations now that we&#39;re in a 64-bit world that could prove useful - the one&#39;s complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only).  Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable.  Here&#39;s a discussion of do the checksum calc in 64-bit arithmetic: <a href="https://locklessinc.com/articles/tcp_checksum/">https://locklessinc.com/articles/tcp_checksum/</a> - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: <a href="https://locklessinc.com/benchmarks_allocator.shtml">https://locklessinc.com/benchmarks_allocator.shtml</a>  (note: I haven&#39;t tried anything from this source, but find it interesting).</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">I&#39;m guessing there are a number of such &quot;small&quot; optimizations that could provide significant performance gains.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Take care,</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Jim</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 6, 2017 at 7:26 AM, Azoff, Justin S <span dir="ltr">&lt;<a href="mailto:jazoff@illinois.edu" target="_blank">jazoff@illinois.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

&gt; On Oct 6, 2017, at 12:10 AM, Clark, Gilbert &lt;<a href="mailto:gc355804@ohio.edu">gc355804@ohio.edu</a>&gt; wrote:<br>

&gt;<br>

&gt; I&#39;ll note that one of the challenges with profiling is that there are the bro scripts, and then there is the bro engine.  The scripting layer has a completely different set of optimizations that might make sense than the engine does: turning off / turning on / tweaking different scripts can have a huge impact on Bro&#39;s relative performance depending on the frequency with which those script fragments are executed.  Thus, one way to look at speeding things up might be to take a look at the scripts that are run most often and seeing about ways to accelerate core pieces of them ... possibly by moving pieces of those scripts to builtins (as C methods).<br>

&gt;<br>

<br>

</span>Re: scripts, I have some code I put together to do arbitrary benchmarks of templated bro scripts.  I need to clean it up and publish it, but I found some interesting things.  Function calls are relatively slow.. so things like<br>

<br>

    ip in Site::local_nets<br>

<br>

Is faster than calling<br>

<br>

    Site::is_local_addr(ip);<br>

<br>

inlining short functions could speed things up a bit.<br>

<br>

I also found that things like<br>

<br>

    port == 22/tcp || port == 3389/tcp<br>

<br>

Is faster than checking if port in {22/tcp,3389/tcp}.. up to about 10 ports.. Having the hash class fallback to a linear search when the hash only contains few items could speed things up there.  Things like &#39;likely_server_ports&#39; have 1 or 2 ports in most cases.<br>

<span class=""><br>

<br>

&gt; If I had to guess at one engine-related thing that would&#39;ve sped things up when I was profiling this stuff back in the day, it&#39;d probably be rebuilding the memory allocation strategy / management.  From what I remember, Bro does do some malloc / free in the data path, which hurts quite a bit when one is trying to make things go fast.  It also means that the selection of a memory allocator and NUMA / per-node memory management is going to be important.  That&#39;s probably not going to qualify as something *small*, though ...<br>

<br>

</span>Ah!  This reminds me of something I was thinking about a few weeks ago.  I&#39;m not sure to what extent bro uses memory allocation pools/interning for common immutable data structures.  Like for port objects or small strings.  There&#39;s no reason bro should be mallocing/freeing memory to create port objects when they are only 65536 times 2 (or 3?) port objects... but bro does things like<br>

<br>

        tcp_hdr-&gt;Assign(0, new PortVal(ntohs(tp-&gt;th_sport), TRANSPORT_TCP));<br>

        tcp_hdr-&gt;Assign(1, new PortVal(ntohs(tp-&gt;th_dport), TRANSPORT_TCP));<br>

<br>

For every packet.  As well as allocating a ton of TYPE_COUNT vals for things like packet sizes and header lengths.. which will almost always be between 0 and 64k.<br>

<br>

For things that can&#39;t be interned, like ipv6 address, having an allocation pool could speed things up... Instead of freeing things like IPAddr objects they could just be returned to a pool, and then when a new IPAddr object is needed, an already initialized object could be grabbed from the pool and &#39;refreshed&#39; with the new value.<br>

<br>

<a href="https://golang.org/pkg/sync/#Pool" rel="noreferrer" target="_blank">https://golang.org/pkg/sync/#<wbr>Pool</a><br>

<br>

Talks about that sort of thing.<br>

<span class=""><br>

&gt; On a related note, a fun experiment is always to try running bro with a different allocator and seeing what happens ...<br>

<br>

</span>I recently noticed our boxes were using jemalloc instead of tcmalloc.. Switching that caused malloc to drop a few places down in &#39;perf top&#39; output.<br>

<br>

<br>

—<br>

<span class="HOEnZb"><font color="#888888">Justin Azoff<br>

<br>

<br>

</font></span></blockquote></div><br></div>