[Bro] A more parallel Bro

Mon Mar 2 08:25:42 PST 2009

Robin Sommer wrote:
> On Wed, Feb 25, 2009 at 14:26 -0600, dalbrech at illinois.edu wrote:
>
>> I've been doing some performance profiling on Bro. 
>
> Interesting! Do you have any results you could share?
I saw Vern's talk a while back at UIUC about how Bro works.  As I recall, there 
were three major components:

1. A packet capture engine, based on libpcap, which makes use of BPF to 
selectively discard packets as early as practical (to reduce the overall quantity of 
traffic),
2. Protocol parsers, which turn blocks of data from the network into 
semantically-meaningful PDU-like data structures, and
3. An event engine, which applies constraints to the lexed protocol data, and 
possibly raises alarms if the traffic is suspicious or otherwise "interesting".

My group initially decided to pursue protocol parsing, because our micro-
benchmarks (many of which appeared in our RAID '08 paper) showed that 
protocol parser-generators (e.g. binpac and Nikita et al.'s GAPA) leave 
significant performance on the table versus a hand-coded parser.  In particular, 
binpac's use of C++ objects to represent PDUs really suffers on protocols 
which feature small, deeply-nested messages, like DNS.

However, the ultimate question is how much the performance gap between a 
hand-coded parser vs. a machine-generated one matters in the context of a 
real system.  So I built a simple tool I call "dns-packet-gen", which generates 
pathological DNS packets constructed for their difficulty in parsing.  The tool 
emits pcap traces containing DNS traffic with many records (50+), all of which 
feature deeply-recursive DNS names.  For the record, Wireshark routinely took 
an hour or more to load traces of 1M of these packets.

Based on my understanding of the Bro code, it looks like parsing "normal" DNS 
traffic (real DNS scraped from my own web browsing) only accounts for 4-5% 
of all CPU time (I used Shark on OS X with a debug build, and checked the 
sampled CPU time of all descendants of UDP_Analyzer::DeliverPacket()).  I'm 
using a version of the "detect backdoors" script which includes DPD, which 
might have skewed the results a little.  (If anyone has comments on this, I'd 
appreciate hearing them).  In either case, my bad DNS traffic only pushed the 
binpac-based analyzers up to 13-14% of CPU time, which is still comparatively 
tiny in relation to the EventMgr's CPU time.

What I take away from this is that, even if binpac is slow, it's not the piece that 
most demands attention.  So I started thinking about the systems-level issues 
(the Shunt paper took this same direction), and realized the main event loop is 
single-threaded.  And that got me to where I am today :-)
>
>> I'm wondering what the ICSI folks' position is on threads vs. clustering.
>
> In short: we consider them orthogonal and are pursuing both. :-)
>
> To elaborate a bit: The cluster is the approach that can provide
> significantly increased performance right now; ignoring a few
> implementation glitches we still need to sort out, it's working and
> ready to use[1]. The cluster is also an approach that will work
> long-term: there will always be limits to what a single box can do
> (multi-core or not) and the cluster offers the capability to go
> beyond that. 
You're right about scale, modulo communications issues.  However, I think it's 
important for a large-scale system to be designed in such a way as to gain 
more performance as the architecture guys throw faster and faster chips over 
the wall.  The way I read the tea leaves, the number of CPU cores in commodity 
PCs isn't dropping anytime soon, so software that isn't written with multi-
threaded event loops is going to hit a performance wall even as CPUs get 
faster. 
>
> That said, we are in parallel (no pun intended :) working on a
> multi-threaded Bro implementation. While we have already made quite
> a bit of progress on that, that will however still take a bit to get
> into any kind of usable state; we need to restructure Bro internally
> quite a bit to exploit its concurrency potential. Once finished
> though, we expect Bro's performance to scale pretty well with the
> number of available cores (not considering other aspects of the
> hardware which might turn into bottlenecks at some point). The
> Sarnoff paper has some of the basic ideas for the multi-threaded
> Bro, and there'll be an extended journal version of that coming out
> soon with more details on the parallel execution model (let me know
> if you'd to like to get a draft). 
>
> Does that answer your question? If you have any particular thoughts
> on either clustering or the multi-threaded Bro, I'd be happy to hear
> them. 
I haven't explored how threadable Bro would be, but it seems you're already on 
it.  Truthfully, I spent a few weeks picking through the code to convince myself 
I knew roughly how to profile it; I'm no expert on Bro internals.  I'm going to 
have a chat with Nikita soon (my academic adviser) to see what he thinks about 
all this, and I'll be in touch.  Please send the draft of the Sarnoff paper to my 
UIUC email address; I'd love to have a look.

I'm wondering how you plan to address cache behavior in your multi-threaded 
implementation of Bro.  Obviously, this is a really hard problem, and I'm not 
sure how well-studied it is in the literature, especially in the context of IDS.  
Fortunately, I have a lot of friends here that do high-performance computing 
work (their long-range research agenda includes a 3000-core CPU tapeout), so 
I'll talk to them about it.

David