[Bro] bro master crashing

Matt Clemons matt.clemons at gmail.com
Mon Mar 20 11:34:20 PDT 2017


Just wanted to give an update to show how crazy this has been.

The segfaults made me think "memory issue", so i ran memtest on the
system.  It has a lot of mems so this took many hours to complete, and with
0 errors.  Pulled power on the system and upon boot, everything came up
fine with a limited set of workers.  I added all 200+ worker processes back
in, and now it's running like a champ again.

The only other thing that it could have been was a power outage on one of
the 10 gig worker boxes.  It kept blipping and coming back up.  Bro cron
was starting processes, and then that worker system was crashing due to
lack of power.  This could have caused the manager to fail.  But i can't
really tell what the root cause was.

Thanks for the responses.

On Thu, Mar 9, 2017 at 5:29 PM, Matt Clemons <matt.clemons at gmail.com> wrote:

> Lots of these.
>
> 0.000000        Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr])  /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000        Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr])  /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000        Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr])  /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000        Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr])  /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
>
> So I commented out that section just for grins, and it still crashes.
>
> [mclemons at bromaster-kcc:~/logs/current ] $ tail -f reporter.log
> 1489101446.599386       Reporter::INFO  processing continued    (empty)
> 1489101446.582511       Reporter::INFO  processing continued    (empty)
> 1489101446.565019       Reporter::INFO  processing suspended    (empty)
> 1489101446.565019       Reporter::INFO  processing continued    (empty)
> 1489101446.637924       Reporter::INFO  processing suspended    (empty)
> 1489101446.637924       Reporter::INFO  processing continued    (empty)
> 1489101446.728349       Reporter::INFO  processing continued    (empty)
> 1489101446.681030       Reporter::INFO  processing continued    (empty)
> 1489101446.751914       Reporter::INFO  processing continued    (empty)
> 1489101446.755815       Reporter::INFO  processing continued    (empty)
> 0.000000        Reporter::INFO  received termination signal     (empty)
> #close  2017-03-09-23-19-16
>
> Child died in the communication.log.
>
> And a segfault:
> 2017-03-09T18:34:06.409225+00:00 HOSTNAME kernel: bro[60506]: segfault at
> 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]
>
> On Thu, Mar 9, 2017 at 5:06 PM, Azoff, Justin S <jazoff at illinois.edu>
> wrote:
>
>>
>> > On Mar 9, 2017, at 5:11 PM, Matt Clemons <matt.clemons at gmail.com>
>> wrote:
>> >
>> > I've disabled cron.
>> >
>> > Still getting "received termination signal." and "child died" in the
>> communications.log.
>>
>> Ah! "child died" makes things interesting.  That's literally the only
>> thing that can cause bro to say 'received termination signal' for an
>> internal reason.  I completely forgot about this case :-(
>>
>> When the child process that handles communication dies, the parent can't
>> continue without it so it kills itself so the whole thing can be restarted
>> in a known working state.
>>
>> Is there anything that shows up in your reporter.log or communication.log
>> right before this happens?
>>
>> Is the kernel logging any segfaults to syslog?
>>
>> --
>> - Justin Azoff
>>
>>
>
>
> --
> Regards,
>
> Matt Clemons
> (816) 200-0789
>



-- 
Regards,

Matt Clemons
(816) 200-0789
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170320/9c5fa3e7/attachment-0001.html 


More information about the Bro mailing list