<div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Interesting info.  The &gt; order of magnitude difference in time between BaseList::remove &amp; BaseList::removenth suggests the possibility that the for loop in BaseList::remove is falling off the end in many cases (i.e. attempting to remove an item that doesn&#39;t exist).  Maybe thats whats broken.</div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Oct 6, 2017 at 3:49 PM, Azoff, Justin S <span dir="ltr">&lt;<a href="mailto:jazoff@illinois.edu" target="_blank">jazoff@illinois.edu</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class=""><br>

&gt; On Oct 6, 2017, at 5:59 PM, Jim Mellander &lt;<a href="mailto:jmellander@lbl.gov">jmellander@lbl.gov</a>&gt; wrote:<br>

&gt;<br>

&gt; I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.<br>

&gt;<br>

&gt; There also are probably some optimizations of frequent operations now that we&#39;re in a 64-bit world that could prove useful - the one&#39;s complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only).  Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable.  Here&#39;s a discussion of do the checksum calc in 64-bit arithmetic: <a href="https://locklessinc.com/articles/tcp_checksum/" rel="noreferrer" target="_blank">https://locklessinc.com/<wbr>articles/tcp_checksum/</a> - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: <a href="https://locklessinc.com/benchmarks_allocator.shtml" rel="noreferrer" target="_blank">https://locklessinc.com/<wbr>benchmarks_allocator.shtml</a>  (note: I haven&#39;t tried anything from this source, but find it interesting).<br>

&gt;<br>

&gt; I&#39;m guessing there are a number of such &quot;small&quot; optimizations that could provide significant performance gains.<br>

&gt;<br>

&gt; Take care,<br>

&gt;<br>

&gt; Jim<br>

<br>

</span>I&#39;ve been messing around with &#39;perf top&#39;, the one&#39;s complement function often shows up fairly high up.. that, PriorityQueue::BubbleDown, and BaseList::remove<br>

<br>

Something (on our configuration?) is doing a lot of PQ_TimerMgr::~PQ_TimerMgr... I don&#39;t think I&#39;ve come across that class before in bro.. I think a script may be triggering something that is hurting performance.  I can&#39;t think of what it would be though.<br>

<br>

Running perf top on a random worker right now with -F 19999 shows:<br>

<br>

Samples: 485K of event &#39;cycles&#39;, Event count (approx.): 26046568975<br>

Overhead  Shared Object                 Symbol<br>

  34.64%  bro                           [.] BaseList::remove<br>

   3.32%  libtcmalloc.so.4.2.6          [.] operator delete<br>

   3.25%  bro                           [.] PriorityQueue::BubbleDown<br>

   2.31%  bro                           [.] BaseList::remove_nth<br>

   2.05%  libtcmalloc.so.4.2.6          [.] operator new<br>

   1.90%  bro                           [.] Attributes::FindAttr<br>

   1.41%  bro                           [.] Dictionary::NextEntry<br>

   1.27%  <a href="http://libc-2.17.so" rel="noreferrer" target="_blank">libc-2.17.so</a>                  [.] __memcpy_ssse3_back<br>

   0.97%  bro                           [.] StmtList::Exec<br>

   0.87%  bro                           [.] Dictionary::Lookup<br>

   0.85%  bro                           [.] NameExpr::Eval<br>

   0.84%  bro                           [.] BroFunc::Call<br>

   0.80%  libtcmalloc.so.4.2.6          [.] tc_free<br>

   0.77%  libtcmalloc.so.4.2.6          [.] operator delete[]<br>

   0.70%  bro                           [.] ones_complement_checksum<br>

   0.60%  libtcmalloc.so.4.2.6          [.] tcmalloc::ThreadCache::<wbr>ReleaseToCentralCache<br>

   0.60%  bro                           [.] RecordVal::RecordVal<br>

   0.53%  bro                           [.] UnaryExpr::Eval<br>

   0.51%  bro                           [.] ExprStmt::Exec<br>

   0.51%  bro                           [.] iosource::Manager::FindSoonest<br>

   0.50%  libtcmalloc.so.4.2.6          [.] operator new[]<br>

<br>

<br>

Which sums up to 59.2%<br>

<br>

BaseList::remove/BaseList::<wbr>remove_nth seems particularly easy to optimize. Can&#39;t that loop be replaced by a memmove?<br>

I think something may be broken if it&#39;s being called that much though.<br>

<br>

<br>

<br>

—<br>

<span class="HOEnZb"><font color="#888888">Justin Azoff<br>

<br>

</font></span></blockquote></div><br></div>