[Bro] Logging and memory leak
hovsep.sanjay.levi at gmail.com
Mon Feb 6 11:14:28 PST 2017
On Sat, Feb 4, 2017 at 4:00 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:
What timeframe was that pcap for? If the pcap was from an hour or so, it's
> probably nothing.. but if that was from a few seconds you could have a
> problem there.
The pcap is 9 seconds during the problem timeframe at the peak of system
> The 2000+ things are definitely related to SSL, as well as the other
> strings in there.. if you look at the raw tcpdump output those would make
> more sense in the normal order.. The numbers are a little inflated because
> when the manager sends out something like the Notice::begin_suppression
> event, it has to send it once to each worker (which is also something that
> needs to be addressed for better scaling bro). ~2064 events would have
> been sent out for only ~14 notices events if you had 150 workers.
Good to know. 132 workers.
> What does your notice.log contain related to SSL. Do you have a TON of
> notices for Invalid_Server_Cert or something like it? Is your
> known_certs.log file growing rapidly?
Not sure at the moment.
> using a larger cutoff for head may have shown SSL::Invalid_Server_Cert.
The 1800 Conn::IN_ORIG are from workers -> manager from
It did not but I think that type of message would have been sent to the
Logger on a different port.
> One thing you could try is commenting out anything in your config related
> to ssl or intel, and see if that's stable. That would help narrow down
> what the problem is.
> In general, the manager just isn't doing much anymore, so for it to be
> using that much ram that fast, it would have to be doing something
> extremely frequently. That's why knowing the timeframe is really important
> If your cluster is doing something like generating Invalid_Server_Cert
> notices at an extremely high rate, then it's possible that the manager
> parent is trying to tell all the workers about it and the manager child is
> not able to keep up. That kind of fits with this output:
I'll try that. Seems like I'll be able to narrow down the issue this
week. There's a weekly pattern to the failure starting late Thursday and
continuing most of the day Friday so I'm guessing either a researcher or an
automated scan is the cause.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Bro