[Bro] Stand-alone cluster problems
Tyler T. Schoenke
Tyler.Schoenke at colorado.edu
Fri Jun 12 10:35:41 PDT 2009
Robin Sommer wrote:
> if I understand you correctly, there are actually two problems:
> - Bro is dropping many packets even when running at rather low CPU
Yes, that is the way it seemed when I didn't have restrict filters
turned on. When the cluster started, the CPU for the Bro process would
be high, but would drop down to 20-40% even though many packets were
being dropped after filtering.
> - after a few days, Bro hangs with 99% CPU and stalls.
Partially correct. Bro appears to be hanging, but the CPU is at 0%, and
the DroppedPackets/received ratio was banging against 99% just before it
started to hang. I haven't restarted the cluster yet, so here is the
Lines from top:
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
51061 XXXXXX 1 -20 0 1207M 843M swread 1 606:53 0.00%
51082 XXXXXX 1 44 5 31556K 228K select 0 19:29 0.00%
I tried attaching to the process with the large TIME value. Is that the
$gdb `which bro-1.4-robin` 51061
#0 0x081d0e96 in free (mem=0xd724e28) at malloc.c:4229
#1 0x285cfc01 in operator delete () from /usr/lib/libstdc++.so.6
#2 0x080a8f0a in ~Dictionary (this=0x99cd4a0) at Dict.cc:101
#3 0x081c7348 in ~TableEntryValPDict (this=0x99cd4a0) at Val.h:49
#4 0x081c42ac in ~TableVal (this=0x99cd408) at Val.cc:1697
#5 0x081c0e28 in TableVal::DoExpire (this=0x8669d60, t=1244434191.756459)
#6 0x081a9be2 in PQ_TimerMgr::DoAdvance (this=0x82f2a18,
new_t=1244434191.756459, max_expire=300) at Timer.cc:164
#7 0x0813ff09 in expire_timers (src_ps=0x90495a0) at Net.cc:392
#8 0x0813ffbd in net_packet_dispatch (t=1244434191.756459, hdr=0x90495d8,
pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0, pkt_elem=0x0)
#9 0x08140549 in net_packet_arrival (t=1244434191.756459, hdr=0x90495d8,
pkt=0x9049a6a "", hdr_size=14, src_ps=0x90495a0) at Net.cc:496
#10 0x0814ef1f in PktSrc::Process (this=0x90495a0) at PktSrc.cc:199
#11 0x081402b5 in net_run () at Net.cc:526
#12 0x080501be in main (argc=454545480, argv=0xbfbfeb28) at main.cc:1056
Here is the bt from the other process just in case it helps.
$gdb `which bro-1.4-robin` 51082
#0 0x286f8da3 in select () from /lib/libc.so.7
#1 0x081617fa in SocketComm::Run (this=0xbfbfe770) at
#2 0x0816629a in RemoteSerializer::Fork (this=0x82fa580)
#3 0x081664aa in RemoteSerializer::Init (this=0x82fa580)
#4 0x0804fbab in main (argc=-2147483647, argv=0xbfbfeb28) at main.cc:956
> Is that correct?
> Regarding the former, generally at 20-30% CPU Bro shouldn't drop any
> signficant amount of packets, there's no throttling mechanism or
> such. One guess here would be the operating system. What kind of
> system are you running on? Have you tried the tuning described on
I'm running FreeBSD 7.1 for i386. I had tried tuning based on the Bro
Wiki, but the following page showed sysctl debug.bpf_bufsize and sysctl
debug.bpf_maxbufsize. Those commands didn't work in FreeBSD 7.1.
The above tu-berlin.de link shows the following:
sysctl -w net.bpf.bufsize=10485760 (10M)
sysctl -w net.bpf.maxbufsize=10485760 (10M)
The Bro-Workshop-July07-tierney.ppt showed the following should be added
to the /etc/sysctl.conf
Based on these two examples, I am guessing the bufsize is where the
buffer starts, and the max is how large it can grow.
Here are my default values:
$sysctl -a |grep net.bpf
According to the FreeBSD 7.1 manpage for sysctl, "The -w option has been
deprecated and is silently ignored". I'll try setting both to 10M, like
in the link you sent.
I also added those values to the /etc/sysctl.conf so they get set on reboot.
I just restarted the cluster, and the bro-1.4-robin process is sitting
at 11-13%. The DroppedPackets/received ratio is flucuating between 3
and 25%. Shouldn't the CPU be maxing out before packets get dropped?
> Another question:
> is there any regularity in the timestamps of when the drops occur?
> Like in regular intervals? (But longer intervals than 10s as that's
> just the reporting interval).
In the previous email, it looks like the intervals were 10s, but there
was a gap of over a minute at epoch 1244425261.942659, which is right
before the cluster froze. I'll try to keep an eye out for that if it
> I wouldn't be totally surprised if the state checkpointing is the
> culprit. To test that, can you remove the line "@load checkpoint"
> from cluster.bro?
I haven't tried this yet. I'll see if the bpf buffer increase helps.
If not, I'll try unloading the checkpoint.bro.
More information about the Bro