[Bro-Dev] Performance Enhancements

Mon Oct 9 13:57:54 PDT 2017

Well, I found pathological behavior with BaseList::remove

Instrumenting it with a printf of num_entries & i after the for loop,
running against a test pcap then summarizing with awk gives:

Count, num_entries, i

1 3583 3536
1 3584 3537
1 3623 3542
1 3624 3543
1 3628 3620
1 3629 3621
1 3636 3562
1 3636 3625
1 3637 3563
1 3637 3626
1 3644 3576
1 3644 3641
1 3645 3577
1 3645 3642
1 3647 3641
1 3648 3642
1 3650 3629
1 3651 3630
1 3658 3647
1 3659 3648
1 3663 3655
1 3664 3656
1 3673 3629
1 3674 3630
1 3697 3686
1 3698 3687
1 3981 3595
1 3982 3596
1 4372 3978
1 4373 3979
1 4374 3656
1 4374 4371
1 4375 3657
1 4375 4372
1 4554 4371
1 4555 4372
1 4571 4367
1 4571 4551
1 4572 4368
1 4572 4552
1 4968 4566
1 4969 4567
1 5058 4566
1 5059 4567
1 5160 4963
1 5161 4964
1 5258 5157
1 5259 5158
1 5342 4566
1 5343 4567
1 5353 5253
1 5354 5254
1 5356 3638
1 5356 5337
1 5356 5350
1 5356 5351
1 5356 5353
1 5357 3639
1 5357 5338
1 5357 5351
1 5357 5352
1 5357 5354
1 5367 5351
1 5368 5352
1 5369 4556
1 5370 4557
1 5374 3675
1 5374 5366
1 5375 3676
1 5375 5367
1 5379 3664
1 5379 5045
1 5380 3665
1 5380 5046
1 5384 3601
1 5385 3602
1 5386 5354
1 5387 5355
1 5392 5370
1 5393 5371
1 5404 5363
1 5404 5381
1 5405 5364
1 5405 5382
1 5408 5341
1 5408 5368
1 5408 5399
1 5409 5342
1 5409 5369
1 5409 5400
1 5413 5401
1 5413 5403
1 5414 5402
1 5414 5404
1 5416 5408
1 5417 5409
1 5429 5395
1 5430 5396
1 5439 5381
1 5439 5406
1 5440 5382
1 5440 5407
1 5460 5436
1 5461 5437
1 5463 5407
1 5464 5408
1 5465 5397
1 5465 5460
1 5466 5398
1 5466 5461
1 5474 5359
1 5474 5451
1 5474 5456
1 5474 5471
1 5475 5360
1 5475 5452
1 5475 5457
1 5475 5472
1 5479 5456
1 5479 5476
1 5480 5457
1 5480 5477
1 5481 5416
1 5482 5417
1 5493 5426
1 5493 5474
1 5493 5488
1 5494 5427
1 5494 5475
1 5494 5489
1 5497 5357
1 5497 5367
1 5497 5461
1 5497 5462
1 5497 5480
1 5497 5488
1 5498 5358
1 5498 5368
1 5498 5462
1 5498 5463
1 5498 5481
1 5498 5489
1 5499 3682
1 5499 5460
1 5499 5476
1 5499 5478
1 5499 5480
1 5500 3683
1 5500 5461
1 5500 5477
1 5500 5479
1 5500 5481
2 3612 3609
2 3613 3610
2 3689 3686
2 3690 3687
2 3697 3694
2 3698 3695
2 5374 5371
2 5375 5372
2 5384 5381
2 5385 5382
2 5463 5460
2 5464 5461
2 5493 5465
2 5494 5466
2 5497 5484
2 5498 5485
2 5499 5482
2 5499 5488
2 5499 5490
2 5499 5492
2 5499 5494
2 5500 5483
2 5500 5489
2 5500 5491
2 5500 5493
2 5500 5495
3 4571 4568
3 4572 4569
3 5493 5490
3 5494 5491
3 5497 5490
3 5497 5492
3 5498 5491
3 5498 5493
3 5499 5486
3 5500 5487
4 3647 3644
4 3648 3645
5 5379 5376
5 5380 5377
7 5499 5496
7 5500 5497
10 5497 5494
10 5498 5495
26 3 2
5861 3 1
13714 4 4
14130 4 1
34914 4 0
74299 3 3
1518194 2 1
2648755 2 2
8166358 3 0
13019625 2 0
62953139 0 0
71512938 1 1
104294506 1 0

there are 286 instances where the list has over 3000 entries, and the
desired entry is near the end...  That linear search has got to be killing
performance, even though its uncommon  :-(

The case of num_entries being 0 can be optimized a bit, but is relatively
minor.

Now, I'll see if I can identify the offending List

Jim

On Mon, Oct 9, 2017 at 1:03 PM, Clark, Gilbert <gc355804 at ohio.edu> wrote:

> If you look in one of the subdirectories or another, in ages past there
> was a little shell script to incrementally execute bro against a specific
> trace, loading one script at a time to see the effect each of them had on
> the overall runtime.  I can't remember what it was called offhand, but it
> was useful for quick and dirty testing.
>
>
> And yes, function calls in bro script land are pretty horrific in terms of
> overhead.  Line per line, bro script in general isn't terribly efficient
> just because it doesn't do any of the things a modern interpreter might
> (e.g. just-in-time compilation).  That's not a criticism, it's only a note
> - most folks rely on horizontal scaling to deal with faster ingress, which
> I think makes a whole lot of sense.
>
>
> Just in the event it's useful, I've attached some profiling I did on the
> script function overhead with the crap I wrote: these are some graphs on
> which script functions call which other script functions, how many times
> that happens, and how much time is spent in each function.
>
>
> avggraph is the average time spent per-call, and resgraph is the aggregate
> time spent in each function across the entire run.  The numbers' formatting
> needed some fixing, but never made it that far ...
>
>
> I know Robin et. al. were working on different approaches for
> next-generation scripting kind of stuff, but haven't kept up well enough to
> really know where those are.
>
>
> One thing I played around with was using plugin hooks to integrate other
> scripting languages into the bro fast path (luajit was my weapon of choice)
> and seeing if conversion from bro script to one of those other languages
> might improve the run time.  Other languages would still be less efficient
> than C, and anything garbage collected would need to be *really* carefully
> used, but ... it struck me as an idea that might be worth a look :)
>
>
> And yeah, generally speaking, most of the toolkits I've played with for
> software-based packet processing absolutely do use memory pools for the
> fast path.  They also use burst fetch tricks (to amortize the cost of
> fetching packets over X packets, rather than fetching one packet at a
> time), and I've also seen quite a bit of prefetch / SIMD to try to keep
> things moving quickly once the packets make it to the CPU.
>
>
> Things start to get pretty crazy as packet rates increase, though: once
> you hit about 10 - 15 Gbps, even a *cache miss* on a modern system is
> enough to force a drop ...
>
>
> For what it's worth ...
>
>
> -Gilbert
>
>
> ------------------------------
> *From:* Azoff, Justin S <jazoff at illinois.edu>
> *Sent:* Monday, October 9, 2017 10:19:26 AM
> *To:* Jim Mellander
> *Cc:* Clark, Gilbert; bro-dev at bro.org
> *Subject:* Re: [Bro-Dev] Performance Enhancements
>
>
> > On Oct 6, 2017, at 5:59 PM, Jim Mellander <jmellander at lbl.gov> wrote:
> >
> > I particularly like the idea of an allocation pool that per-packet
> information can be stored, and reused by the next packet.
>
> Turns out bro does this most of the time.. unless you use the next_packet
> event.  Normal connections use the sessions cache which holds connection
> objects, but new_packet has its own code path that creates the ip header
> from scratch for each packet.  I tried to pre-allocate PortVal objects, but
> I think I was screwing something up with 'Ref' and bro would just segfault
> on the 2nd connection.
>
>
> > There also are probably some optimizations of frequent operations now
> that we're in a 64-bit world that could prove useful - the one's complement
> checksum calculation in net_util.cc is one that comes to mind, especially
> since it works effectively a byte at a time (and works with even byte
> counts only).  Seeing as this is done per-packet on all tcp payload,
> optimizing this seems reasonable.  Here's a discussion of do the checksum
> calc in 64-bit arithmetic: https://locklessinc.com/articles/tcp_checksum/
> - this website also has an x64 allocator that is claimed to be faster than
> tcmalloc, see: https://locklessinc.com/benchmarks_allocator.shtml  (note:
> I haven't tried anything from this source, but find it interesting).
> The TCP/IP Checksum - Lockless Inc
> <https://locklessinc.com/articles/tcp_checksum/>
> locklessinc.com
> The above obvious algorithm handles 16-bits at a time. Usually, if you
> process more data at a time, then performance is better. Since we have a
> 64-bit machine, we ...
>
> Benchmarks of the Lockless Memory Allocator
> <https://locklessinc.com/benchmarks_allocator.shtml>
> locklessinc.com
> The speed of various memory allocators is compared.
>
>
>
> I couldn't get this code to return the right checksums inside bro (some
> casting issue?), but if it is faster it should increase performance by a
> small percentage.  Comparing 'bro -b' runs on a pcap with 'bro -b -C' runs
> (which should show what kind of performance increase we would get if that
> function took 0s to run) shows a decent chunk of time taken computing
> checksums.
>
> > I'm guessing there are a number of such "small" optimizations that could
> provide significant performance gains.
>
> I've been trying to figure out the best way to profile bro.  So far
> attempting to use linux perf, or google perftools hasn't been able to shed
> much light on anything.  I think the approach I was using to benchmark
> certain operations in the bro language is the better approach.
>
> Instead of running bro and trying to profile it to figure out what is
> causing the most load, simply compare the execution of two bro runs with
> slightly different scripts/settings.  I think this will end up being the
> better approach because it answers real questions like "If I load this
> script or change this setting what is the performance impact on the bro
> process".  When I did this last I used this method to compare the
> performance from one bro commit to the next, but I never tried comparing
> bro with one set of scripts loaded to bro with a different set of scripts
> loaded.
>
> For example, the simplest and most dramatic test I came up with so far:
>
> $ time bro -r 2009-M57-day11-18.trace -b
> real    0m2.434s
> user    0m2.236s
> sys     0m0.200s
>
> $ cat np.bro
> event new_packet(c: connection, p: pkt_hdr)
> {
>
> }
>
> $ time bro -r 2009-M57-day11-18.trace -b np.bro
> real    0m10.588s
> user    0m10.392s
> sys     0m0.204s
>
> We've been saying for a while that adding that event is expensive, but I
> don't know if it's even been quantified.
>
> The main thing I still need to figure out is how to do this type of test
> in a cluster environment while replaying a long pcap.
>
>
>
> Somewhat related, came across this presentation yesterday:
>
> https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be
> <https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be>
> CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High
> Performance Trading Systems in C++”
> <https://www.youtube.com/watch?v=NH1Tta7purM&feature=youtu.be>
> www.youtube.com
> http://CppCon.org — Presentation Slides, PDFs, Source Code and other
> presenter materials are available at: https://github.com/CppCon/CppCon2017
> — Automated t...
>
>
>
> CppCon 2017: Carl Cook “When a Microsecond Is an Eternity: High
> Performance Trading Systems in C++”
>
> Among other things, he mentions using a memory pool for objects instead of
> creating/deleting them.
>
>
>
> —
> Justin Azoff
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/bro-dev/attachments/20171009/a6673f2a/attachment-0001.html