[Bro] OOM-killer & Bro
Gary Faulkner
gary at doit.wisc.edu
Tue Feb 4 10:43:14 PST 2014
The output is below. I've been running the host longer than I have Orca
graphs for. When looking at the graphs, you can identify the restarts
based on the sudden spike in free memory (in blue). There is a series of
restarts toward the end of last week and beginning of this week where I
was experimenting with a script and making changes. That script was only
tested between last Thursday and this Monday. Traffic and log rates are
taken between / 11AM-1PM. In some cases I tried to collect multiple samples.
The history of OOM events predates the graphs, so I thought it would be
useful to know them as well. I have been reducing workload & workers
over time as part of troubleshooting. I also sometimes have an issue
where bro log rotations fails, and I need to rotate logs manually and
restart. This usually happens when available/cached memory drops below
about 8G.
Machines each have 2 of e5-2670s (8-cores, 2.6Ghz) & 64G RAM. So 16
cores / 32 HT per machine.
OOM-Killer (host 1, manager + 2 proxies + x workers):
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Dec 27 - 12 workers
Jan 26 - 12 workers
Jan 30 - 12 workers - might be related to script testing
Feb 1 - 12 workers - might be related to script testing
OOM-Killer (host2, 2 proxies + x workers):
Nov 20 - 24 workers
Nov 21 - 24 workers
Nov 22 - 24 workers
Nov 25 - 24 workers
Dec 4 - 20 workers
Dec 5 - 20 workers
Dec 6 - 20 workers
Jan 26 - 12 workers
broctl top 11:15AM:
==================
Name Type Node Pid Proc VSize Rss
Cpu Cmd
manager manager host1 27037 parent 14G 12G 99% bro
manager manager host1 27038 child 183M 106M 32% bro
proxy-1 proxy host1 27089 child 94M 17M 12% bro
proxy-1 proxy host1 27086 parent 1G 1G 5% bro
proxy-2 proxy host1 27088 child 94M 19M 14% bro
proxy-2 proxy host1 27087 parent 1G 1G 9% bro
proxy-3 proxy host2 8848 child 94M 37M 12% bro
proxy-3 proxy host2 8846 parent 1G 1G 10% bro
proxy-4 proxy host2 8849 child 94M 37M 14% bro
proxy-4 proxy host2 8847 parent 1G 1G 5% bro
worker-1-1 worker host1 27320 parent 2G 2G 96% bro
worker-1-1 worker host1 27421 child 377M 18M 10% bro
worker-1-10 worker host1 27323 parent 2G 2G 97% bro
worker-1-10 worker host1 27414 child 373M 13M 8% bro
worker-1-11 worker host1 27324 parent 2G 2G 100% bro
worker-1-11 worker host1 27399 child 369M 10M 7% bro
worker-1-12 worker host1 27325 parent 2G 2G 85% bro
worker-1-12 worker host1 27410 child 369M 7M 10% bro
worker-1-2 worker host1 27328 parent 2G 2G 99% bro
worker-1-2 worker host1 27446 child 369M 8M 7% bro
worker-1-3 worker host1 27330 parent 2G 2G 97% bro
worker-1-3 worker host1 27427 child 369M 9M 8% bro
worker-1-4 worker host1 27331 parent 2G 2G 100% bro
worker-1-4 worker host1 27411 child 369M 10M 7% bro
worker-1-5 worker host1 27332 parent 2G 2G 97% bro
worker-1-5 worker host1 27430 child 369M 7M 10% bro
worker-1-6 worker host1 27333 parent 3G 3G 98% bro
worker-1-6 worker host1 27413 child 369M 9M 10% bro
worker-1-7 worker host1 27335 parent 2G 2G 100% bro
worker-1-7 worker host1 27426 child 369M 8M 8% bro
worker-1-8 worker host1 27334 parent 2G 2G 100% bro
worker-1-8 worker host1 27433 child 369M 9M 7% bro
worker-1-9 worker host1 27336 parent 2G 2G 100% bro
worker-1-9 worker host1 27425 child 369M 10M 9% bro
worker-2-1 worker host2 9495 parent 2G 2G 98% bro
worker-2-1 worker host2 9533 child 369M 30M 9% bro
worker-2-10 worker host2 9494 parent 2G 2G 100% bro
worker-2-10 worker host2 9582 child 369M 30M 7% bro
worker-2-11 worker host2 9496 parent 3G 3G 97% bro
worker-2-11 worker host2 9586 child 369M 30M 9% bro
worker-2-12 worker host2 9492 parent 2G 2G 97% bro
worker-2-12 worker host2 9585 child 369M 30M 9% bro
worker-2-2 worker host2 9502 parent 2G 2G 98% bro
worker-2-2 worker host2 9512 child 369M 29M 10% bro
worker-2-3 worker host2 9493 parent 2G 2G 98% bro
worker-2-3 worker host2 9511 child 370M 31M 9% bro
worker-2-4 worker host2 9498 parent 2G 2G 90% bro
worker-2-4 worker host2 9576 child 369M 30M 10% bro
worker-2-5 worker host2 9503 parent 2G 2G 98% bro
worker-2-5 worker host2 9517 child 369M 30M 9% bro
worker-2-6 worker host2 9500 parent 2G 2G 99% bro
worker-2-6 worker host2 9506 child 369M 31M 7% bro
worker-2-7 worker host2 9499 parent 2G 2G 100% bro
worker-2-7 worker host2 9538 child 369M 29M 9% bro
worker-2-8 worker host2 9497 parent 2G 2G 99% bro
worker-2-8 worker host2 9587 child 369M 29M 5% bro
worker-2-9 worker host2 9501 parent 2G 2G 98% bro
worker-2-9 worker host2 9519 child 369M 30M 7% bro
broctl top 12:21PM:
==================
Name Type Node Pid Proc VSize Rss
Cpu Cmd
manager manager host1 27037 parent 14G 12G 158% bro
manager manager host1 27038 child 143M 66M 37% bro
proxy-1 proxy host1 27089 child 94M 17M 10% bro
proxy-1 proxy host1 27086 parent 1G 1G 5% bro
proxy-2 proxy host1 27088 child 94M 18M 16% bro
proxy-2 proxy host1 27087 parent 1G 1G 8% bro
proxy-3 proxy host2 8848 child 94M 37M 14% bro
proxy-3 proxy host2 8846 parent 1G 1G 12% bro
proxy-4 proxy host2 8849 child 94M 37M 16% bro
proxy-4 proxy host2 8847 parent 1G 1G 5% bro
worker-1-1 worker host1 27320 parent 3G 3G 97% bro
worker-1-1 worker host1 27421 child 377M 18M 9% bro
worker-1-10 worker host1 27323 parent 3G 3G 99% bro
worker-1-10 worker host1 27414 child 373M 13M 10% bro
worker-1-11 worker host1 27324 parent 3G 3G 96% bro
worker-1-11 worker host1 27399 child 369M 10M 10% bro
worker-1-12 worker host1 27325 parent 3G 3G 100% bro
worker-1-12 worker host1 27410 child 369M 7M 12% bro
worker-1-2 worker host1 27328 parent 3G 3G 98% bro
worker-1-2 worker host1 27446 child 369M 7M 10% bro
worker-1-3 worker host1 27330 parent 3G 3G 99% bro
worker-1-3 worker host1 27427 child 369M 9M 10% bro
worker-1-4 worker host1 27331 parent 3G 3G 100% bro
worker-1-4 worker host1 27411 child 369M 9M 7% bro
worker-1-5 worker host1 27332 parent 3G 3G 95% bro
worker-1-5 worker host1 27430 child 369M 7M 10% bro
worker-1-6 worker host1 27333 parent 3G 3G 97% bro
worker-1-6 worker host1 27413 child 369M 8M 10% bro
worker-1-7 worker host1 27335 parent 3G 3G 97% bro
worker-1-7 worker host1 27426 child 369M 8M 8% bro
worker-1-8 worker host1 27334 parent 3G 3G 99% bro
worker-1-8 worker host1 27433 child 369M 9M 9% bro
worker-1-9 worker host1 27336 parent 3G 3G 100% bro
worker-1-9 worker host1 27425 child 369M 10M 9% bro
worker-2-1 worker host2 9495 parent 3G 3G 95% bro
worker-2-1 worker host2 9533 child 369M 30M 5% bro
worker-2-10 worker host2 9494 parent 3G 3G 98% bro
worker-2-10 worker host2 9582 child 369M 30M 7% bro
worker-2-11 worker host2 9496 parent 3G 3G 99% bro
worker-2-11 worker host2 9586 child 369M 30M 10% bro
worker-2-12 worker host2 9492 parent 3G 3G 99% bro
worker-2-12 worker host2 9585 child 369M 30M 10% bro
worker-2-2 worker host2 9502 parent 3G 3G 98% bro
worker-2-2 worker host2 9512 child 369M 29M 10% bro
worker-2-3 worker host2 9493 parent 3G 3G 99% bro
worker-2-3 worker host2 9511 child 370M 31M 7% bro
worker-2-4 worker host2 9498 parent 3G 3G 97% bro
worker-2-4 worker host2 9576 child 369M 30M 10% bro
worker-2-5 worker host2 9503 parent 3G 3G 98% bro
worker-2-5 worker host2 9517 child 369M 30M 9% bro
worker-2-6 worker host2 9500 parent 3G 3G 100% bro
worker-2-6 worker host2 9506 child 369M 31M 9% bro
worker-2-7 worker host2 9499 parent 3G 3G 99% bro
worker-2-7 worker host2 9538 child 369M 29M 9% bro
worker-2-8 worker host2 9497 parent 3G 3G 98% bro
worker-2-8 worker host2 9587 child 369M 29M 7% bro
worker-2-9 worker host2 9501 parent 3G 3G 100% bro
worker-2-9 worker host2 9519 child 369M 30M 7% bro
broctl capstats:
=============
11:15AM:
Interface kpps mbps (10s average)
------------------------------
host1/dnacluster:21 460.3 2993.6
host2/dnacluster:22 497.4 3291.3
Total 957.7 6284.9
11:30AM:
Interface kpps mbps (10s average)
------------------------------
host1/dnacluster:21 509.0 3301.1
host2/dnacluster:22 469.3 2933.3
Total 978.3 6234.4
12:15PM:
Interface kpps mbps (10s average)
------------------------------
host1/dnacluster:21 565.3 3741.6
host2/dnacluster:22 522.6 3358.8
Total 1087.9 7100.4
free -m on host 1 (manager + 2 proxies + 12 workers) 11:15AM:
=======================================================
total used free shared buffers cached
Mem: 64377 63670 707 0 71 19091
-/+ buffers/cache: 44506 19871
Swap: 1023 650 373
free -m on host 1 (manager + 2 proxies + 12 workers) 12:15PM:
=======================================================
total used free shared buffers cached
Mem: 64377 64108 269 0 0 8245
-/+ buffers/cache: 55862 8515
Swap: 1023 1023 0
free -m on host 2 (2 proxies + 12 workers) 11:15AM:
==============================================
total used free shared buffers cached
Mem: 64377 34827 29550 0 104 2184
-/+ buffers/cache: 32538 31839
Swap: 1023 17 1006
free -m on host 2 (2 proxies + 12 workers) 12:15PM:
==============================================
total used free shared buffers cached
Mem: 64377 46118 18259 0 104 2186
-/+ buffers/cache: 43827 20550
Swap: 1023 17 1006
Log rate: (/current)
=================
11:15AM
cat * | wc -l ; sleep 1m ; cat * | wc -l
22006062
23762376
diff=1,756,314/min
11:30AM
cat * | wc -l ; sleep 1m ; cat * | wc -l
7618833
9873332
diff=2,254,499/min
Bro failed log rotation at 11:40AM, so I had to manually rotate logs and
restart.
12:28PM:
cat * | wc -l ; sleep 1m ; cat * | wc -l
14526373
16633887
diff=2,107,514/min
Regards,
Gary Faulkner
UW Madison
Office of Campus Information Security
608-262-8591
On 2/4/2014 9:01 AM, Justin Azoff wrote:
> On Mon, Feb 03, 2014 at 08:31:45PM -0600, Gary Faulkner wrote:
>> I've been thinking I may need more than 64G of RAM per node (16 core /
>> 3-5G traffic, & 12 workers each). I seem to run with 100% of the RAM
>> allocated, but 20-30% of my RAM cached before something happens to cause
>> a sudden drop in cached memory (as seen on Orca graphs) resulting in
>> OOM-killer dropping one or more Bro processes.
> You should be fine with those specs.. 12 workers should be using closer
> to 12G of ram, not anywhere near 64G.
>
> Can you post the output of
>
> free -m # on one of the worker nodes
> broctl top # on the manager
>
> and to get an idea of your msg log rate:
>
> cat bro/logs/current/* | wc -l ; sleep 1m ; cat bro/logs/current/* | wc -l
>
> Can you also share the memory graph from this system over time,
> particularly after a fresh restart of bro?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1hourlymemuse4FEB2014.png
Type: image/png
Size: 6604 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1monthlymemuse4FEB2014.png
Type: image/png
Size: 7972 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0001.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1weeklymemuse4FEB2014.png
Type: image/png
Size: 9472 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0002.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: host1dailymemuse4FEB2014.png
Type: image/png
Size: 10299 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0003.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6257 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140204/3736c6e3/attachment-0004.bin
More information about the Bro
mailing list