[Bro] Logging and memory leak

Fri Feb 3 20:00:39 PST 2017

> On Feb 3, 2017, at 7:48 PM, Hovsep Levi <hovsep.sanjay.levi at gmail.com> wrote:
> 
> No, no custom scripts.  I found the cluster overwhelmed again today with massive virtual memory usage and an hour after restarting it the same condition returns.
> 
> I'm using a single logger since the last time.  It seems when using a Kafka only export a single logger works fine as the event timestamps arriving at Kafka are near realtime.  (the "ts" for conn, http, etc.) 
> 
> The logger is using 47761, manager is 47762.  I took a sample of 5000 packets for each during the high memory usage and it looks like the manager is still receiving logs of some sort dealing with x509 certificates.
> 
> 
> tcpdump -A -r Bro__Manager_port_47662.pcap | egrep -io '[A-Za-z_:-]{10,}' | sort | uniq -c | sort -rn | head -10
> reading from file Bro__Manager_port_47662.pcap, link-type EN10MB (Ethernet)
> 2852 certificate
> 2064 Notice::begin_suppressionA
> 1800 Conn::IN_ORIG
>  739 Authentication
>  685 Identifier
>  681 Encipherment
>  660 validation
>  646 Corporation
> 

What timeframe was that pcap for?  If the pcap was from an hour or so, it's probably nothing.. but if that was from a few seconds you could have a problem there.

The 2000+ things are definitely related to SSL, as well as the other strings in there.. if you look at the raw tcpdump output those would make more sense in the normal order..  The numbers are a little inflated because when the manager sends out something like the Notice::begin_suppression event, it has to send it once to each worker (which is also something that needs to be addressed for better scaling bro).  ~2064 events would have been sent out for only ~14 notices events if you had 150 workers.

What does your notice.log contain related to SSL.  Do you have a TON of notices for Invalid_Server_Cert or something like it?  Is your known_certs.log file growing rapidly?

using a larger cutoff for head may have shown SSL::Invalid_Server_Cert.

The 1800 Conn::IN_ORIG are from workers -> manager from policy/frameworks/intel/seen/conn-established.bro

One thing you could try is commenting out anything in your config related to ssl or intel, and see if that's stable.  That would help narrow down what the problem is.

In general, the manager just isn't doing much anymore, so for it to be using that much ram that fast, it would have to be doing something extremely frequently.  That's why knowing the timeframe is really important :-)

If your cluster is doing something like generating Invalid_Server_Cert notices at an extremely high rate, then it's possible that the manager parent is trying to tell all the workers about it and the manager child is not able to keep up.  That kind of fits with this output:

manager      manager 10.1.1.1   37653   child   506M   237M 100%  bro
manager      manager 10.1.1.1   37600   parent  640G    46G  38%  bro

-- 
- Justin Azoff