[Bro] bro master crashing
Matt Clemons
matt.clemons at gmail.com
Mon Mar 20 11:34:20 PDT 2017
Just wanted to give an update to show how crazy this has been.
The segfaults made me think "memory issue", so i ran memtest on the
system. It has a lot of mems so this took many hours to complete, and with
0 errors. Pulled power on the system and upon boot, everything came up
fine with a limited set of workers. I added all 200+ worker processes back
in, and now it's running like a champ again.
The only other thing that it could have been was a power outage on one of
the 10 gig worker boxes. It kept blipping and coming back up. Bro cron
was starting processes, and then that worker system was crashing due to
lack of power. This could have caused the manager to fail. But i can't
really tell what the root cause was.
Thanks for the responses.
On Thu, Mar 9, 2017 at 5:29 PM, Matt Clemons <matt.clemons at gmail.com> wrote:
> Lots of these.
>
> 0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
> 0.000000 Reporter::ERROR no such index (Cluster::nodes[Intel::p$
> descr]) /opt/bro/share/bro/base/frameworks/intel/./cluster.bro, line 35
>
> So I commented out that section just for grins, and it still crashes.
>
> [mclemons at bromaster-kcc:~/logs/current ] $ tail -f reporter.log
> 1489101446.599386 Reporter::INFO processing continued (empty)
> 1489101446.582511 Reporter::INFO processing continued (empty)
> 1489101446.565019 Reporter::INFO processing suspended (empty)
> 1489101446.565019 Reporter::INFO processing continued (empty)
> 1489101446.637924 Reporter::INFO processing suspended (empty)
> 1489101446.637924 Reporter::INFO processing continued (empty)
> 1489101446.728349 Reporter::INFO processing continued (empty)
> 1489101446.681030 Reporter::INFO processing continued (empty)
> 1489101446.751914 Reporter::INFO processing continued (empty)
> 1489101446.755815 Reporter::INFO processing continued (empty)
> 0.000000 Reporter::INFO received termination signal (empty)
> #close 2017-03-09-23-19-16
>
> Child died in the communication.log.
>
> And a segfault:
> 2017-03-09T18:34:06.409225+00:00 HOSTNAME kernel: bro[60506]: segfault at
> 0 ip 00000000005fcf8d sp 00007fffaf9d2f40 error 6 in bro[400000+624000]
>
> On Thu, Mar 9, 2017 at 5:06 PM, Azoff, Justin S <jazoff at illinois.edu>
> wrote:
>
>>
>> > On Mar 9, 2017, at 5:11 PM, Matt Clemons <matt.clemons at gmail.com>
>> wrote:
>> >
>> > I've disabled cron.
>> >
>> > Still getting "received termination signal." and "child died" in the
>> communications.log.
>>
>> Ah! "child died" makes things interesting. That's literally the only
>> thing that can cause bro to say 'received termination signal' for an
>> internal reason. I completely forgot about this case :-(
>>
>> When the child process that handles communication dies, the parent can't
>> continue without it so it kills itself so the whole thing can be restarted
>> in a known working state.
>>
>> Is there anything that shows up in your reporter.log or communication.log
>> right before this happens?
>>
>> Is the kernel logging any segfaults to syslog?
>>
>> --
>> - Justin Azoff
>>
>>
>
>
> --
> Regards,
>
> Matt Clemons
> (816) 200-0789
>
--
Regards,
Matt Clemons
(816) 200-0789
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170320/9c5fa3e7/attachment-0001.html
More information about the Bro
mailing list