[Bro-Dev] Performance Enhancements
Azoff, Justin S
jazoff at illinois.edu
Fri Oct 6 15:49:30 PDT 2017
> On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellander at lbl.gov> wrote:
> I particularly like the idea of an allocation pool that per-packet information can be stored, and reused by the next packet.
> There also are probably some optimizations of frequent operations now that we're in a 64-bit world that could prove useful - the one's complement checksum calculation in net_util.cc is one that comes to mind, especially since it works effectively a byte at a time (and works with even byte counts only). Seeing as this is done per-packet on all tcp payload, optimizing this seems reasonable. Here's a discussion of do the checksum calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/ - this website also has an x64 allocator that is claimed to be faster than tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml (note: I haven't tried anything from this source, but find it interesting).
> I'm guessing there are a number of such "small" optimizations that could provide significant performance gains.
> Take care,
I've been messing around with 'perf top', the one's complement function often shows up fairly high up.. that, PriorityQueue::BubbleDown, and BaseList::remove
Something (on our configuration?) is doing a lot of PQ_TimerMgr::~PQ_TimerMgr... I don't think I've come across that class before in bro.. I think a script may be triggering something that is hurting performance. I can't think of what it would be though.
Running perf top on a random worker right now with -F 19999 shows:
Samples: 485K of event 'cycles', Event count (approx.): 26046568975
Overhead Shared Object Symbol
34.64% bro [.] BaseList::remove
3.32% libtcmalloc.so.4.2.6 [.] operator delete
3.25% bro [.] PriorityQueue::BubbleDown
2.31% bro [.] BaseList::remove_nth
2.05% libtcmalloc.so.4.2.6 [.] operator new
1.90% bro [.] Attributes::FindAttr
1.41% bro [.] Dictionary::NextEntry
1.27% libc-2.17.so [.] __memcpy_ssse3_back
0.97% bro [.] StmtList::Exec
0.87% bro [.] Dictionary::Lookup
0.85% bro [.] NameExpr::Eval
0.84% bro [.] BroFunc::Call
0.80% libtcmalloc.so.4.2.6 [.] tc_free
0.77% libtcmalloc.so.4.2.6 [.] operator delete
0.70% bro [.] ones_complement_checksum
0.60% libtcmalloc.so.4.2.6 [.] tcmalloc::ThreadCache::ReleaseToCentralCache
0.60% bro [.] RecordVal::RecordVal
0.53% bro [.] UnaryExpr::Eval
0.51% bro [.] ExprStmt::Exec
0.51% bro [.] iosource::Manager::FindSoonest
0.50% libtcmalloc.so.4.2.6 [.] operator new
Which sums up to 59.2%
BaseList::remove/BaseList::remove_nth seems particularly easy to optimize. Can't that loop be replaced by a memmove?
I think something may be broken if it's being called that much though.
More information about the bro-dev