[Bro] internal error: unknown msg type 101 in Poll()

Sat Feb 20 07:17:35 PST 2010

I have been seeing several crashes per day due to 'internal error:
unknown msg type 101 in Poll()' in the manager process of a bro cluster
handling ~2.5 Gb/s of traffic.  Here is a typical stack trace:

> Program terminated with signal 6, Aborted.
> #0  0x000000080158ef6c in kill () from /lib/libc.so.6
> #0  0x000000080158ef6c in kill () from /lib/libc.so.6
> #1  0x000000080158ddfd in abort () from /lib/libc.so.6
> #2  0x000000000040b329 in internal_error () at SSLInterpreter.cc:31
> #3  0x000000000050efde in RemoteSerializer::InternalCommError (this=0x8fd3,
> msg=0x8fd3 <Address 0x8fd3 out of bounds>) at RemoteSerializer.cc:2714
> #4  0x000000000051668b in RemoteSerializer::Poll (this=0x7cb7e0,
> may_block=false) at RemoteSerializer.cc:1477
> #5  0x0000000000516c83 in RemoteSerializer::NextTimestamp (this=0x7cb7e0,
> local_network_time=0x7fffffffe330) at RemoteSerializer.cc:1294
> #6  0x00000000004d6575 in IOSourceRegistry::FindSoonest (this=0x79a310,
> ts=0x7fffffffe518) at stl_list.h:131
> #7  0x00000000004f2df3 in net_run () at Net.cc:509
> #8  0x0000000000408938 in main (argc=36152552, argv=0x0) at main.cc:999

This seems to be the same problem as ticket #203.  Robin's comment (see
<http://tracker.icir.org/bro/ticket/203#comment:1> suggests this may be
caused by high system load, but that doesn't seem to be the case.

To check this, I have set up two clusters fed by the same input traffic.
 The first is a cluster of seven machines with a single bro instance
running on each.  The cluster has four workers, two proxies, and the
manager node.  In broctl, 'top' rarely reports CPU utilization over 10%
for any node, and memory consumption is typically < 250 MB per process.
 The manager process in this cluster crashes several times per day.

The second cluster is just one machine: a dual quad-core Xeon system
with 16 GB of RAM.  It is running six instances of bro: four workers
each listening to a different network interface, one proxy, and one
manager.  CPU utilization is often ~50% on the workers, and as high as
20% on the manager.  Although 'netstats' reports more packet loss for
this cluster, the manager does not crash.

Is there some other line of investigation I should pursue?  A
single-machine Bro cluster won't handle much more traffic, so this isn't
a useful workaround for the long term.