[Bro] OOM-killer & Bro

Gary Faulkner gary at doit.wisc.edu
Tue Feb 4 10:43:14 PST 2014


The output is below. I've been running the host longer than I have Orca 
graphs for. When looking at the graphs, you can identify the restarts 
based on the sudden spike in free memory (in blue). There is a series of 
restarts toward the end of last week and beginning of this week where I 
was experimenting with a script and making changes. That script was only 
tested between last Thursday and this Monday. Traffic and log rates are 
taken between / 11AM-1PM. In some cases I tried to collect multiple samples.

The history of OOM events predates the graphs, so I thought it would be 
useful to know them as well. I have been reducing workload & workers 
over time as part of troubleshooting. I also sometimes have an issue 
where bro log rotations fails, and I need to rotate logs manually and 
restart. This usually happens when available/cached  memory drops below 
about 8G.

Machines each have 2 of e5-2670s (8-cores, 2.6Ghz) & 64G RAM. So 16 
cores / 32 HT per machine.

OOM-Killer (host 1, manager + 2 proxies + x workers):
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec  4 - 20 workers
Dec  5 - 20 workers
Dec  6 - 20 workers
Dec 27 - 12 workers
Jan 26 - 12 workers
Jan 30 - 12 workers - might be related to script testing
Feb  1 - 12 workers - might be related to script testing

OOM-Killer (host2, 2 proxies + x workers):
Nov 20 - 24 workers
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec  4 - 20 workers
Dec  5 - 20 workers
Dec  6 - 20 workers
Jan 26 - 12 workers

broctl top 11:15AM:
==================

Name       Type       Node       Pid      Proc     VSize Rss      
Cpu      Cmd
manager    manager    host1 27037    parent    14G      12G 99%      bro
manager    manager    host1 27038    child    183M     106M 32%      bro
proxy-1    proxy      host1 27089    child     94M      17M 12%      bro
proxy-1    proxy      host1 27086    parent     1G       1G 5%       bro
proxy-2    proxy      host1 27088    child     94M      19M 14%      bro
proxy-2    proxy      host1 27087    parent     1G       1G 9%       bro
proxy-3    proxy      host2 8848     child     94M      37M 12%      bro
proxy-3    proxy      host2 8846     parent     1G       1G 10%      bro
proxy-4    proxy      host2 8849     child     94M      37M 14%      bro
proxy-4    proxy      host2 8847     parent     1G       1G 5%       bro
worker-1-1 worker     host1 27320    parent     2G       2G 96%      bro
worker-1-1 worker     host1 27421    child    377M      18M 10%      bro
worker-1-10 worker     host1 27323    parent     2G       2G 97%      bro
worker-1-10 worker     host1 27414    child    373M      13M 8%       bro
worker-1-11 worker     host1 27324    parent     2G       2G 100%     bro
worker-1-11 worker     host1 27399    child    369M      10M 7%       bro
worker-1-12 worker     host1 27325    parent     2G       2G 85%      bro
worker-1-12 worker     host1 27410    child    369M       7M 10%      bro
worker-1-2 worker     host1 27328    parent     2G       2G 99%      bro
worker-1-2 worker     host1 27446    child    369M       8M 7%       bro
worker-1-3 worker     host1 27330    parent     2G       2G 97%      bro
worker-1-3 worker     host1 27427    child    369M       9M 8%       bro
worker-1-4 worker     host1 27331    parent     2G       2G 100%     bro
worker-1-4 worker     host1 27411    child    369M      10M 7%       bro
worker-1-5 worker     host1 27332    parent     2G       2G 97%      bro
worker-1-5 worker     host1 27430    child    369M       7M 10%      bro
worker-1-6 worker     host1 27333    parent     3G       3G 98%      bro
worker-1-6 worker     host1 27413    child    369M       9M 10%      bro
worker-1-7 worker     host1 27335    parent     2G       2G 100%     bro
worker-1-7 worker     host1 27426    child    369M       8M 8%       bro
worker-1-8 worker     host1 27334    parent     2G       2G 100%     bro
worker-1-8 worker     host1 27433    child    369M       9M 7%       bro
worker-1-9 worker     host1 27336    parent     2G       2G 100%     bro
worker-1-9 worker     host1 27425    child    369M      10M 9%       bro
worker-2-1 worker     host2 9495     parent     2G       2G 98%      bro
worker-2-1 worker     host2 9533     child    369M      30M 9%       bro
worker-2-10 worker     host2 9494     parent     2G       2G 100%     bro
worker-2-10 worker     host2 9582     child    369M      30M 7%       bro
worker-2-11 worker     host2 9496     parent     3G       3G 97%      bro
worker-2-11 worker     host2 9586     child    369M      30M 9%       bro
worker-2-12 worker     host2 9492     parent     2G       2G 97%      bro
worker-2-12 worker     host2 9585     child    369M      30M 9%       bro
worker-2-2 worker     host2 9502     parent     2G       2G 98%      bro
worker-2-2 worker     host2 9512     child    369M      29M 10%      bro
worker-2-3 worker     host2 9493     parent     2G       2G 98%      bro
worker-2-3 worker     host2 9511     child    370M      31M 9%       bro
worker-2-4 worker     host2 9498     parent     2G       2G 90%      bro
worker-2-4 worker     host2 9576     child    369M      30M 10%      bro
worker-2-5 worker     host2 9503     parent     2G       2G 98%      bro
worker-2-5 worker     host2 9517     child    369M      30M 9%       bro
worker-2-6 worker     host2 9500     parent     2G       2G 99%      bro
worker-2-6 worker     host2 9506     child    369M      31M 7%       bro
worker-2-7 worker     host2 9499     parent     2G       2G 100%     bro
worker-2-7 worker     host2 9538     child    369M      29M 9%       bro
worker-2-8 worker     host2 9497     parent     2G       2G 99%      bro
worker-2-8 worker     host2 9587     child    369M      29M 5%       bro
worker-2-9 worker     host2 9501     parent     2G       2G 98%      bro
worker-2-9 worker     host2 9519     child    369M      30M 7%       bro

broctl top 12:21PM:
==================

Name       Type       Node       Pid      Proc     VSize Rss      
Cpu      Cmd
manager    manager    host1 27037    parent    14G      12G 158%     bro
manager    manager    host1 27038    child    143M      66M 37%      bro
proxy-1    proxy      host1 27089    child     94M      17M 10%      bro
proxy-1    proxy      host1 27086    parent     1G       1G 5%       bro
proxy-2    proxy      host1 27088    child     94M      18M 16%      bro
proxy-2    proxy      host1 27087    parent     1G       1G 8%       bro
proxy-3    proxy      host2 8848     child     94M      37M 14%      bro
proxy-3    proxy      host2 8846     parent     1G       1G 12%      bro
proxy-4    proxy      host2 8849     child     94M      37M 16%      bro
proxy-4    proxy      host2 8847     parent     1G       1G 5%       bro
worker-1-1 worker     host1 27320    parent     3G       3G 97%      bro
worker-1-1 worker     host1 27421    child    377M      18M 9%       bro
worker-1-10 worker     host1 27323    parent     3G       3G 99%      bro
worker-1-10 worker     host1 27414    child    373M      13M 10%      bro
worker-1-11 worker     host1 27324    parent     3G       3G 96%      bro
worker-1-11 worker     host1 27399    child    369M      10M 10%      bro
worker-1-12 worker     host1 27325    parent     3G       3G 100%     bro
worker-1-12 worker     host1 27410    child    369M       7M 12%      bro
worker-1-2 worker     host1 27328    parent     3G       3G 98%      bro
worker-1-2 worker     host1 27446    child    369M       7M 10%      bro
worker-1-3 worker     host1 27330    parent     3G       3G 99%      bro
worker-1-3 worker     host1 27427    child    369M       9M 10%      bro
worker-1-4 worker     host1 27331    parent     3G       3G 100%     bro
worker-1-4 worker     host1 27411    child    369M       9M 7%       bro
worker-1-5 worker     host1 27332    parent     3G       3G 95%      bro
worker-1-5 worker     host1 27430    child    369M       7M 10%      bro
worker-1-6 worker     host1 27333    parent     3G       3G 97%      bro
worker-1-6 worker     host1 27413    child    369M       8M 10%      bro
worker-1-7 worker     host1 27335    parent     3G       3G 97%      bro
worker-1-7 worker     host1 27426    child    369M       8M 8%       bro
worker-1-8 worker     host1 27334    parent     3G       3G 99%      bro
worker-1-8 worker     host1 27433    child    369M       9M 9%       bro
worker-1-9 worker     host1 27336    parent     3G       3G 100%     bro
worker-1-9 worker     host1 27425    child    369M      10M 9%       bro
worker-2-1 worker     host2 9495     parent     3G       3G 95%      bro
worker-2-1 worker     host2 9533     child    369M      30M 5%       bro
worker-2-10 worker     host2 9494     parent     3G       3G 98%      bro
worker-2-10 worker     host2 9582     child    369M      30M 7%       bro
worker-2-11 worker     host2 9496     parent     3G       3G 99%      bro
worker-2-11 worker     host2 9586     child    369M      30M 10%      bro
worker-2-12 worker     host2 9492     parent     3G       3G 99%      bro
worker-2-12 worker     host2 9585     child    369M      30M 10%      bro
worker-2-2 worker     host2 9502     parent     3G       3G 98%      bro
worker-2-2 worker     host2 9512     child    369M      29M 10%      bro
worker-2-3 worker     host2 9493     parent     3G       3G 99%      bro
worker-2-3 worker     host2 9511     child    370M      31M 7%       bro
worker-2-4 worker     host2 9498     parent     3G       3G 97%      bro
worker-2-4 worker     host2 9576     child    369M      30M 10%      bro
worker-2-5 worker     host2 9503     parent     3G       3G 98%      bro
worker-2-5 worker     host2 9517     child    369M      30M 9%       bro
worker-2-6 worker     host2 9500     parent     3G       3G 100%     bro
worker-2-6 worker     host2 9506     child    369M      31M 9%       bro
worker-2-7 worker     host2 9499     parent     3G       3G 99%      bro
worker-2-7 worker     host2 9538     child    369M      29M 9%       bro
worker-2-8 worker     host2 9497     parent     3G       3G 98%      bro
worker-2-8 worker     host2 9587     child    369M      29M 7%       bro
worker-2-9 worker     host2 9501     parent     3G       3G 100%     bro
worker-2-9 worker     host2 9519     child    369M      30M 7%       bro
broctl capstats:
=============
11:15AM:
Interface            kpps       mbps       (10s average)
------------------------------
host1/dnacluster:21 460.3      2993.6
host2/dnacluster:22 497.4      3291.3

Total                957.7      6284.9

11:30AM:
Interface            kpps       mbps       (10s average)
------------------------------
host1/dnacluster:21 509.0      3301.1
host2/dnacluster:22 469.3      2933.3

Total                978.3      6234.4

12:15PM:

Interface            kpps       mbps       (10s average)
------------------------------
host1/dnacluster:21 565.3      3741.6
host2/dnacluster:22 522.6      3358.8

Total                1087.9     7100.4

free -m on host 1 (manager + 2 proxies + 12 workers) 11:15AM:
=======================================================
              total       used       free     shared    buffers cached
Mem:         64377      63670        707          0 71      19091
-/+ buffers/cache:      44506      19871
Swap:         1023        650        373

free -m on host 1 (manager + 2 proxies + 12 workers) 12:15PM:
=======================================================
              total       used       free     shared    buffers cached
Mem:         64377      64108        269          0 0       8245
-/+ buffers/cache:      55862       8515
Swap:         1023       1023          0

free -m on host 2 (2 proxies + 12 workers) 11:15AM:
==============================================
              total       used       free     shared    buffers cached
Mem:         64377      34827      29550          0 104       2184
-/+ buffers/cache:      32538      31839
Swap:         1023         17       1006

free -m on host 2 (2 proxies + 12 workers) 12:15PM:
==============================================
              total       used       free     shared    buffers cached
Mem:         64377      46118      18259          0 104       2186
-/+ buffers/cache:      43827      20550
Swap:         1023         17       1006

Log rate: (/current)
=================
11:15AM
cat * | wc -l ; sleep 1m ; cat * | wc -l
22006062
23762376
diff=1,756,314/min

11:30AM
cat * | wc -l ; sleep 1m ; cat * | wc -l
7618833
9873332
diff=2,254,499/min

Bro failed log rotation at 11:40AM, so I had to manually rotate logs and 
restart.

12:28PM:
  cat * | wc -l ; sleep 1m ; cat * | wc -l
14526373
16633887
diff=2,107,514/min

Regards,

Gary Faulkner
UW Madison
Office of Campus Information Security
608-262-8591

On 2/4/2014 9:01 AM, Justin Azoff wrote:
> On Mon, Feb 03, 2014 at 08:31:45PM -0600, Gary Faulkner wrote:
>> I've been thinking I may need more than 64G of RAM per node (16 core /
>> 3-5G traffic, & 12 workers each). I seem to run with 100% of the RAM
>> allocated, but 20-30% of my RAM cached before something happens to cause
>> a sudden drop in cached memory (as seen on Orca graphs) resulting in
>> OOM-killer dropping one or more Bro processes.
> You should be fine with those specs..  12 workers should be using closer
> to 12G of ram, not anywhere near 64G.
>
> Can you post the output of
>
>      free -m     # on one of the worker nodes
>      broctl top  # on the manager
>
> and to get an idea of your msg log rate:
>
>      cat bro/logs/current/* | wc -l ; sleep 1m ; cat bro/logs/current/* | wc -l
>
> Can you also share the memory graph from this system over time,
> particularly after a fresh restart of bro?
>

-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1hourlymemuse4FEB2014.png
Type: image/png
Size: 6604 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1monthlymemuse4FEB2014.png
Type: image/png
Size: 7972 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0001.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1weeklymemuse4FEB2014.png
Type: image/png
Size: 9472 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0002.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1dailymemuse4FEB2014.png
Type: image/png
Size: 10299 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0003.bin 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6257 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0004.bin 


More information about the Bro mailing list