[Bro] Stand-alone cluster problems

Fri Jun 12 10:35:41 PDT 2009

Robin Sommer wrote:
> if I understand you correctly, there are actually two problems:
> 
> - Bro is dropping many packets even when running at rather low CPU

Hi Robin,
Yes, that is the way it seemed when I didn't have restrict filters 
turned on.  When the cluster started, the CPU for the Bro process would 
be high, but would drop down to 20-40% even though many packets were 
being dropped after filtering.

> - after a few days, Bro hangs with 99% CPU and stalls.

Partially correct.  Bro appears to be hanging, but the CPU is at 0%, and 
the DroppedPackets/received ratio was banging against 99% just before it 
started to hang.  I haven't restarted the cluster yet, so here is the 
backtrace.

Lines from top:
   PID USERNAME    THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
51061 XXXXXX        1 -20    0  1207M   843M swread 1 606:53  0.00% 
bro-1.4-rob
51082 XXXXXX        1  44    5 31556K   228K select 0  19:29  0.00% 
bro-1.4-rob

I tried attaching to the process with the large TIME value.  Is that the 
primary one?

$gdb `which bro-1.4-robin` 51061
(gdb) bt
#0  0x081d0e96 in free (mem=0xd724e28) at malloc.c:4229
#1  0x285cfc01 in operator delete () from /usr/lib/libstdc++.so.6
#2  0x080a8f0a in ~Dictionary (this=0x99cd4a0) at Dict.cc:101
#3  0x081c7348 in ~TableEntryValPDict (this=0x99cd4a0) at Val.h:49
#4  0x081c42ac in ~TableVal (this=0x99cd408) at Val.cc:1697
#5  0x081c0e28 in TableVal::DoExpire (this=0x8669d60, t=1244434191.756459)
     at Obj.h:213
#6  0x081a9be2 in PQ_TimerMgr::DoAdvance (this=0x82f2a18,
     new_t=1244434191.756459, max_expire=300) at Timer.cc:164
#7  0x0813ff09 in expire_timers (src_ps=0x90495a0) at Net.cc:392
#8  0x0813ffbd in net_packet_dispatch (t=1244434191.756459, hdr=0x90495d8,
     pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0, pkt_elem=0x0)
     at Net.cc:412
#9  0x08140549 in net_packet_arrival (t=1244434191.756459, hdr=0x90495d8,
     pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0) at Net.cc:496
#10 0x0814ef1f in PktSrc::Process (this=0x90495a0) at PktSrc.cc:199
#11 0x081402b5 in net_run () at Net.cc:526
#12 0x080501be in main (argc=454545480, argv=0xbfbfeb28) at main.cc:1056

Here is the bt from the other process just in case it helps.
$gdb `which bro-1.4-robin` 51082
(gdb) bt
#0  0x286f8da3 in select () from /lib/libc.so.7
#1  0x081617fa in SocketComm::Run (this=0xbfbfe770) at 
RemoteSerializer.cc:2743
#2  0x0816629a in RemoteSerializer::Fork (this=0x82fa580)
     at RemoteSerializer.cc:600
#3  0x081664aa in RemoteSerializer::Init (this=0x82fa580)
     at RemoteSerializer.cc:525
#4  0x0804fbab in main (argc=-2147483647, argv=0xbfbfeb28) at main.cc:956

> Is that correct? 
> 
> Regarding the former, generally at 20-30% CPU Bro shouldn't drop any
> signficant amount of packets, there's no throttling mechanism or
> such. One guess here would be the operating system. What kind of
> system are you running on? Have you tried the tuning described on
> http://www.net.t-labs.tu-berlin.de/research/bpcs/?

I'm running FreeBSD 7.1 for i386.  I had tried tuning based on the Bro 
Wiki, but the following page showed sysctl debug.bpf_bufsize and sysctl 
debug.bpf_maxbufsize.  Those commands didn't work in FreeBSD 7.1.
http://www.bro-ids.org/wiki/index.php/User_Manual:_Performance_Tuning

The above tu-berlin.de link shows the following:

sysctl -w net.bpf.bufsize=10485760 (10M)
sysctl -w net.bpf.maxbufsize=10485760 (10M)

The Bro-Workshop-July07-tierney.ppt showed the following should be added 
to the /etc/sysctl.conf

net.bpf.bufsize=4194304 (4M)
net.bpf.maxbufsize=8388608 (8M)

Based on these two examples, I am guessing the bufsize is where the 
buffer starts, and the max is how large it can grow.

Here are my default values:
$sysctl -a |grep net.bpf
net.bpf.maxbufsize: 524288
net.bpf.bufsize: 4096

According to the FreeBSD 7.1 manpage for sysctl, "The -w option has been 
deprecated and is silently ignored".  I'll try setting both to 10M, like 
in the link you sent.

$sysctl net.bpf.bufsize=10485760
$sysctl net.bpf.maxbufsize=10485760

I also added those values to the /etc/sysctl.conf so they get set on reboot.

I just restarted the cluster, and the bro-1.4-robin process is sitting 
at 11-13%.  The DroppedPackets/received ratio is flucuating between 3 
and 25%.  Shouldn't the CPU be maxing out before packets get dropped?

 > Another question:
 > is there any regularity in the timestamps of when the drops occur?
 > Like in regular intervals? (But longer intervals than 10s as that's
 > just the reporting interval).

In the previous email, it looks like the intervals were 10s, but there 
was a gap of over a minute at epoch 1244425261.942659, which is right 
before the cluster froze.  I'll try to keep an eye out for that if it 
happens again.

> I wouldn't be totally surprised if the state checkpointing is the
> culprit. To test that, can you remove the line "@load checkpoint"
> from cluster.bro? 

I haven't tried this yet.  I'll see if the bpf buffer increase helps. 
If not, I'll try unloading the checkpoint.bro.

Tyler