[Bro] Logging and memory leak

Mon Feb 6 11:14:28 PST 2017

On Sat, Feb 4, 2017 at 4:00 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:

What timeframe was that pcap for?  If the pcap was from an hour or so, it's
> probably nothing.. but if that was from a few seconds you could have a
> problem there.
>

The pcap is 9 seconds during the problem timeframe at the peak of system
usage.

> The 2000+ things are definitely related to SSL, as well as the other
> strings in there.. if you look at the raw tcpdump output those would make
> more sense in the normal order..  The numbers are a little inflated because
> when the manager sends out something like the Notice::begin_suppression
> event, it has to send it once to each worker (which is also something that
> needs to be addressed for better scaling bro).  ~2064 events would have
> been sent out for only ~14 notices events if you had 150 workers.
>
>
Good to know.  132 workers.

> What does your notice.log contain related to SSL.  Do you have a TON of
> notices for Invalid_Server_Cert or something like it?  Is your
> known_certs.log file growing rapidly?
>

Not sure at the moment.

> using a larger cutoff for head may have shown SSL::Invalid_Server_Cert.
>
>
The 1800 Conn::IN_ORIG are from workers -> manager from
> policy/frameworks/intel/seen/conn-established.bro
>
>
It did not but I think that type of message would have been sent to the
Logger on a different port.

2064 Notice::begin_suppression
1800 Conn::IN_ORIG
 396 Notice::cluster_notice
  26 SumStats::cluster_key_intermediate_response
   1 Intel::match_no_items
   1 Conn::Info

> One thing you could try is commenting out anything in your config related
> to ssl or intel, and see if that's stable.  That would help narrow down
> what the problem is.
>
> In general, the manager just isn't doing much anymore, so for it to be
> using that much ram that fast, it would have to be doing something
> extremely frequently.  That's why knowing the timeframe is really important
> :-)
>
> If your cluster is doing something like generating Invalid_Server_Cert
> notices at an extremely high rate, then it's possible that the manager
> parent is trying to tell all the workers about it and the manager child is
> not able to keep up.  That kind of fits with this output:
>
>

I'll try that.  Seems like I'll be able to narrow down the issue this
week.  There's a weekly pattern to the failure starting late Thursday and
continuing most of the day Friday so I'm guessing either a researcher or an
automated scan is the cause.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170206/154f6f8a/attachment.html