[Zeek] Worker being "killed nohup"
Jeffry Lang
jefflang at vt.edu
Wed Nov 6 16:23:14 PST 2019
Can you expand on "tons of state"? It sounds like something I'm familiar
with but want to make sure. I work with Mark and we've been talking a lot
about this.
On Wed, Nov 6, 2019, 6:52 PM Michał Purzyński <michalpurzynski1 at gmail.com>
wrote:
> Let's dig into the root cause.
> Your OOM most likely happens because either
> - you have tons of state and hit something that Justin neatly described as
> "unbound state growth"
> - or your logger threads cannot write messages quickly enough
>
> Given the OOM output it looks to me like workers are to blame. Do you have
> any custom scripts? Do you use the stock scan.bro or any other SumStats
> scripts?
>
> Can you observe the system for a while, in something like htop, or top
> with threads, to see what kind of processes eat most of memory and report
> back?
>
>
> On Wed, Nov 6, 2019 at 11:13 AM Mark Gardner <mkg at vt.edu> wrote:
>
>> On Wed, Nov 6, 2019 at 1:13 PM Jeff Barber <jsbarber60 at gmail.com> wrote:
>>
>>> > I'm running Zeek 2.6.4 and I have been seeing occasional error
>>> messages of the form:
>>> >
>>> > run-bro: line 110: 42574 Killed nohup ${pin_command}
>>> $pin_cpu "$mybro" "$@"
>>> >
>>> > The workers seem to be restarted fine and other than the error
>>> message, I haven't noticed any ill behavior. What should I do about the
>>> error messages?
>>>
>>> I would check your syslog. Assuming you are running linux, if your
>>> system runs out of memory, the kernel will go find the biggest process
>>> and kill it. This can often be a zeek process as they tend to grow large as
>>> more connections are tracked (depending on many factors: what scripts you
>>> are running, what you're logging, what kind of traffic is being seen,
>>> etc.). If that's happening, you should see something in syslog containing
>>> the string "invoked oom-killer:" If you look at the surrounding lines,
>>> there should be some info on process sizes showing why it was selected.
>>>
>>
>> Good catch Jeff. A crash email arrived at 10:10 and the journal entry
>> starting around 10:07 contained the following:
>>
>> Nov 06 10:07:08 sensor1 kernel: oom_reaper: reaped process 43437 (bro),
>> now anon-rss:0kB, file-rss:1048576kB, shmem-rss:0kB
>> Nov 06 10:07:05 sensor1 kernel: Killed process 43437 (bro)
>> total-vm:67116888kB, anon-rss:57339476kB, file-rss:1042296kB, shmem-rss:0kB
>> Nov 06 10:07:05 sensor1 kernel: Out of memory: Kill process 43437 (bro)
>> score 444 or sacrifice child
>> Nov 06 10:07:05 sensor1 kernel: [ 54824] 47 54824 1516502 1221530
>> 11284480 80 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 54817] 47 54817 1661 56
>> 49152 16 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51543] 47 51543 1516502 1281056
>> 11886592 3616 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51542] 47 51542 1516502 1285841
>> 11964416 2296 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51539] 47 51539 1516502 1277957
>> 11890688 2817 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51537] 47 51537 1516502 1282107
>> 11829248 2543 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51533] 47 51533 1516502 1276417
>> 11939840 7591 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51532] 47 51532 1516502 1290264
>> 12042240 8006 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51523] 47 51523 1516502 1281160
>> 11894784 7479 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51521] 47 51521 1516502 1285423
>> 11984896 7957 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51520] 47 51520 1745878 1370574
>> 12668928 6908 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51501] 47 51501 1516502 1286290
>> 11788288 6967 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51499] 47 51499 1661 22
>> 49152 49 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51498] 47 51498 1516502 1276891
>> 11882496 7233 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51496] 47 51496 1516502 1287064
>> 11948032 7685 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51491] 47 51491 1661 18
>> 53248 55 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51487] 47 51487 1661 18
>> 57344 54 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51474] 47 51474 1661 21
>> 49152 52 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51471] 47 51471 1516502 1277820
>> 11890688 7964 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51465] 47 51465 1661 20
>> 53248 53 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51455] 47 51455 1661 23
>> 57344 49 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51440] 47 51440 1661 18
>> 49152 54 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51438] 47 51438 1661 23
>> 49152 49 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51428] 47 51428 1661 18
>> 49152 54 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51414] 47 51414 1661 16
>> 53248 56 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51405] 47 51405 1661 20
>> 53248 52 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51397] 47 51397 1661 14
>> 45056 59 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 51375] 47 51375 1661 15
>> 49152 58 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 43437] 47 43437 16779222 14595446
>> 122212352 129972 0 bro
>> Nov 06 10:07:05 sensor1 kernel: [ 43343] 47 43343 1661 2
>> 49152 70 0 run-bro
>> Nov 06 10:07:05 sensor1 kernel: [ 43183] 47 43183 42625 4
>> 106496 608 0 (sd-pam)
>> Nov 06 10:07:05 sensor1 kernel: [ 43181] 47 43181 5290 0
>> 77824 342 0 systemd
>> Nov 06 10:07:05 sensor1 kernel: [ 996] 0 996 1346 0
>> 49152 33 0 agetty
>> Nov 06 10:07:05 sensor1 kernel: [ 995] 0 995 1403 0
>> 45056 32 0 agetty
>> Nov 06 10:07:05 sensor1 kernel: [ 994] 0 994 3963 38
>> 69632 178 -1000 sshd
>> Nov 06 10:07:05 sensor1 kernel: [ 991] 0 991 2119 20
>> 53248 35 0 cron
>> Nov 06 10:07:05 sensor1 kernel: [ 990] 105 990 2210 97
>> 57344 55 -900 dbus-daemon
>> Nov 06 10:07:05 sensor1 kernel: [ 977] 0 977 56456 90
>> 86016 378 0 rsyslogd
>> Nov 06 10:07:05 sensor1 kernel: [ 968] 0 968 4875 78
>> 77824 189 0 systemd-logind
>> Nov 06 10:07:05 sensor1 kernel: [ 944] 0 944 2021 1
>> 57344 799 0 haveged
>> Nov 06 10:07:05 sensor1 kernel: [ 931] 100 931 23270 26
>> 86016 207 0 systemd-timesyn
>> Nov 06 10:07:05 sensor1 kernel: [ 640] 0 640 5579 14
>> 69632 309 -1000 systemd-udevd
>> Nov 06 10:07:05 sensor1 kernel: [ 615] 0 615 74225 2624
>> 638976 568 0 systemd-journal
>> Nov 06 10:07:05 sensor1 kernel: [ pid ] uid tgid total_vm rss
>> pgtables_bytes swapents oom_score_adj name
>> Nov 06 10:07:05 sensor1 kernel: Tasks state (memory values in pages):
>> Nov 06 10:07:05 sensor1 kernel: 0 pages hwpoisoned
>> Nov 06 10:07:05 sensor1 kernel: 551979 pages reserved
>> Nov 06 10:07:05 sensor1 kernel: 0 pages HighMem/MovableOnly
>> Nov 06 10:07:05 sensor1 kernel: 33528566 pages RAM
>> Nov 06 10:07:05 sensor1 kernel: Total swap = 976892kB
>> Nov 06 10:07:05 sensor1 kernel: Free swap = 0kB
>> Nov 06 10:07:05 sensor1 kernel: Swap cache stats: add 759161, delete
>> 711025, find 25951/48832
>> Nov 06 10:07:05 sensor1 kernel: 48128 pages in swap cache
>> Nov 06 10:07:05 sensor1 kernel: 52774 total pagecache pages
>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=2048kB
>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=1048576kB
>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=2048kB
>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0
>> hugepages_surp=0 hugepages_size=1048576kB
>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal: 73*4kB (U) 150*8kB (UH)
>> 59*16kB (UH) 1325*32kB (UH) 5*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB
>> 0*2048kB 0*4096kB = 45412kB
>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal: 78*4kB (UMEH) 213*8kB
>> (UEH) 70*16kB (UH) 1232*32kB (UMH) 7*64kB (H) 0*128kB 0*256kB 0*512kB
>> 0*1024kB 0*2048kB 0*4096kB = 43008kB
>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32: 697*4kB (UME) 647*8kB (UME)
>> 392*16kB (UME) 190*32kB (UME) 94*64kB (UE) 51*128kB (UE) 63*256kB (UME)
>> 240*512kB (UME) 72*1024kB (UME) 0*2048kB 0*4096kB = 245596kB
>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA: 1*4kB (U) 1*8kB (U) 2*16kB
>> (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB
>> (M) 3*4096kB (M) = 15884kB
>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal free:44844kB min:45116kB
>> low:111144kB high:177172kB active_anon:48629496kB inactive_anon:3250728kB
>> active_file:64kB inactive_file:668kB unevictable:0kB writepending:0kB
>> present:67094528kB managed:66031300kB mlocked:0kB kernel_stack:5352kB
>> pagetables:146976kB bounce
>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal free:41932kB min:42604kB
>> low:104956kB high:167308kB active_anon:55633904kB inactive_anon:4082748kB
>> active_file:560kB inactive_file:0kB unevictable:0kB writepending:0kB
>> present:63438848kB managed:62359848kB mlocked:0kB kernel_stack:4776kB
>> pagetables:133008kB bounce:
>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 60894 60894 60894
>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32 free:245356kB min:2372kB
>> low:5844kB high:9316kB active_anon:3199852kB inactive_anon:44556kB
>> active_file:100kB inactive_file:0kB unevictable:0kB writepending:0kB
>> present:3564892kB managed:3499316kB mlocked:0kB kernel_stack:48kB
>> pagetables:3092kB bounce:0kB free_pcp:4
>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 3392 64286 64286 64286
>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA free:15884kB min:8kB low:20kB
>> high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB
>> inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB
>> managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
>> free_pcp:0kB local_pcp:0kB free_cma:0kB
>> Nov 06 10:07:05 sensor1 kernel: Node 3 active_anon:48629496kB
>> inactive_anon:3250728kB active_file:108kB inactive_file:700kB
>> unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:13636088kB
>> dirty:0kB writeback:0kB shmem:9996kB shmem_thp: 0kB shmem_pmdmapped: 0kB
>> anon_thp: 1337344kB writeback_tmp:0kB unst
>> Nov 06 10:07:05 sensor1 kernel: Node 1 active_anon:58833756kB
>> inactive_anon:4127304kB active_file:548kB inactive_file:0kB unevictable:0kB
>> isolated(anon):0kB isolated(file):0kB mapped:2101152kB dirty:0kB
>> writeback:0kB shmem:7828kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp:
>> 13608960kB writeback_tmp:0kB unstab
>> Nov 06 10:07:05 sensor1 kernel: active_anon:26865813
>> inactive_anon:1844508 isolated_anon:0
>> active_file:163 inactive_file:0
>> isolated_file:0
>> unevictable:0 dirty:0 writeback:0
>> unstable:0
>> slab_reclaimable:25057
>> slab_unreclaimable:65745
>> mapped:3934289 shmem:4456
>> pagetables:70769 bounce:0
>> free:87073 free_pcp:3080 free_cma:0
>> Nov 06 10:07:05 sensor1 kernel: Mem-Info:
>> Nov 06 10:07:05 sensor1 kernel: R13: 0000000000001800 R14:
>> 00007f375bd4f780 R15: 0000000000000022
>> Nov 06 10:07:05 sensor1 kernel: R10: 00007f3618971000 R11:
>> 00007f36189727e0 R12: 00007f375bd4f940
>> Nov 06 10:07:05 sensor1 kernel: RBP: 00007f361424b000 R08:
>> 0000000000000000 R09: 00007f36189727e0
>> Nov 06 10:07:05 sensor1 kernel: RDX: 0000000000001780 RSI:
>> 00007f361424b000 RDI: 00007f3618971000
>> Nov 06 10:07:05 sensor1 kernel: RAX: 00007f3618971000 RBX:
>> 0000000000001800 RCX: 00007f361424c760
>> Nov 06 10:07:05 sensor1 kernel: RSP: 002b:00007ffd609fbc58 EFLAGS:
>> 00010202
>> Nov 06 10:07:05 sensor1 kernel: Code: Bad RIP value.
>> Nov 06 10:07:05 sensor1 kernel: RIP: 0033:0x7f375bf1f925
>> Nov 06 10:07:05 sensor1 kernel: page_fault+0x1e/0x30
>> Nov 06 10:07:05 sensor1 kernel: ? page_fault+0x8/0x30
>> Nov 06 10:07:05 sensor1 kernel: __do_page_fault+0x249/0x4f0
>> Nov 06 10:07:05 sensor1 kernel: handle_mm_fault+0xd6/0x200
>> Nov 06 10:07:05 sensor1 kernel: ? __switch_to_asm+0x41/0x70
>> Nov 06 10:07:05 sensor1 kernel: __handle_mm_fault+0x958/0x1270
>> Nov 06 10:07:05 sensor1 kernel: alloc_pages_vma+0x74/0x1c0
>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_nodemask+0x28b/0x2b0
>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_slowpath+0xbd8/0xcb0
>> Nov 06 10:07:05 sensor1 kernel: out_of_memory+0x1a5/0x430
>> Nov 06 10:07:05 sensor1 kernel: oom_kill_process.cold.30+0xb/0x1cf
>> Nov 06 10:07:05 sensor1 kernel: dump_header+0x6b/0x283
>> Nov 06 10:07:05 sensor1 kernel: dump_stack+0x5c/0x80
>> Nov 06 10:07:05 sensor1 kernel: Call Trace:
>> Nov 06 10:07:05 sensor1 kernel: Hardware name: Supermicro Super
>> Server/H11SSL-i, BIOS 1.0b 04/27/2018
>> Nov 06 10:07:05 sensor1 kernel: CPU: 3 PID: 51496 Comm: bro Not tainted
>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u1
>> Nov 06 10:07:05 sensor1 kernel: bro cpuset=/ mems_allowed=1,3
>> Nov 06 10:07:05 sensor1 kernel: bro invoked oom-killer:
>> gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),
>> order=0, oom_score_adj=0
>>
>> It looks like bro was using 64 GB of RAM (summation of even though the
>> system has 128 GB installed. (It also looks like there was no free swap.)
>> Searching the Internet for ways to solve OOM, people suggested turning off
>> over-commit, and configuring the overcommit ratio a bit higher. Currently,
>> the values are:
>>
>> # sysctl vm.overcommit_memory
>> vm.overcommit_memory = 0
>> # sysctl vm.overcommit_ratio
>> vm.overcommit_ratio = 50
>>
>> In one case the suggestion was to change those to vm.overcommit_memory =
>> 1 and vm.overcommit_ratio = 80 which seems reasonable based on the
>> description at
>> https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
>>
>> Or is there a better way to handle OOM?
>>
>> Mark
>> _______________________________________________
>> Zeek mailing list
>> zeek at zeek.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>
> _______________________________________________
> Zeek mailing list
> zeek at zeek.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191106/5ddc4550/attachment-0001.html
More information about the Zeek
mailing list