[Zeek] Worker being "killed nohup"
Michał Purzyński
michalpurzynski1 at gmail.com
Wed Nov 6 19:21:39 PST 2019
BTW is your Zeek built with tcmalloc? I heard horror stories about that now
I just use stock glibc malloc or jemalloc.
On Wed, Nov 6, 2019 at 4:23 PM Jeffry Lang <jefflang at vt.edu> wrote:
> Can you expand on "tons of state"? It sounds like something I'm familiar
> with but want to make sure. I work with Mark and we've been talking a lot
> about this.
>
> On Wed, Nov 6, 2019, 6:52 PM Michał Purzyński <michalpurzynski1 at gmail.com>
> wrote:
>
>> Let's dig into the root cause.
>> Your OOM most likely happens because either
>> - you have tons of state and hit something that Justin neatly described
>> as "unbound state growth"
>> - or your logger threads cannot write messages quickly enough
>>
>> Given the OOM output it looks to me like workers are to blame. Do you
>> have any custom scripts? Do you use the stock scan.bro or any other
>> SumStats scripts?
>>
>> Can you observe the system for a while, in something like htop, or top
>> with threads, to see what kind of processes eat most of memory and report
>> back?
>>
>>
>> On Wed, Nov 6, 2019 at 11:13 AM Mark Gardner <mkg at vt.edu> wrote:
>>
>>> On Wed, Nov 6, 2019 at 1:13 PM Jeff Barber <jsbarber60 at gmail.com> wrote:
>>>
>>>> > I'm running Zeek 2.6.4 and I have been seeing occasional error
>>>> messages of the form:
>>>> >
>>>> > run-bro: line 110: 42574 Killed nohup
>>>> ${pin_command} $pin_cpu "$mybro" "$@"
>>>> >
>>>> > The workers seem to be restarted fine and other than the error
>>>> message, I haven't noticed any ill behavior. What should I do about the
>>>> error messages?
>>>>
>>>> I would check your syslog. Assuming you are running linux, if your
>>>> system runs out of memory, the kernel will go find the biggest process
>>>> and kill it. This can often be a zeek process as they tend to grow large as
>>>> more connections are tracked (depending on many factors: what scripts you
>>>> are running, what you're logging, what kind of traffic is being seen,
>>>> etc.). If that's happening, you should see something in syslog containing
>>>> the string "invoked oom-killer:" If you look at the surrounding lines,
>>>> there should be some info on process sizes showing why it was selected.
>>>>
>>>
>>> Good catch Jeff. A crash email arrived at 10:10 and the journal entry
>>> starting around 10:07 contained the following:
>>>
>>> Nov 06 10:07:08 sensor1 kernel: oom_reaper: reaped process 43437 (bro),
>>> now anon-rss:0kB, file-rss:1048576kB, shmem-rss:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Killed process 43437 (bro)
>>> total-vm:67116888kB, anon-rss:57339476kB, file-rss:1042296kB, shmem-rss:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Out of memory: Kill process 43437 (bro)
>>> score 444 or sacrifice child
>>> Nov 06 10:07:05 sensor1 kernel: [ 54824] 47 54824 1516502 1221530
>>> 11284480 80 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 54817] 47 54817 1661 56
>>> 49152 16 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51543] 47 51543 1516502 1281056
>>> 11886592 3616 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51542] 47 51542 1516502 1285841
>>> 11964416 2296 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51539] 47 51539 1516502 1277957
>>> 11890688 2817 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51537] 47 51537 1516502 1282107
>>> 11829248 2543 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51533] 47 51533 1516502 1276417
>>> 11939840 7591 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51532] 47 51532 1516502 1290264
>>> 12042240 8006 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51523] 47 51523 1516502 1281160
>>> 11894784 7479 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51521] 47 51521 1516502 1285423
>>> 11984896 7957 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51520] 47 51520 1745878 1370574
>>> 12668928 6908 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51501] 47 51501 1516502 1286290
>>> 11788288 6967 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51499] 47 51499 1661 22
>>> 49152 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51498] 47 51498 1516502 1276891
>>> 11882496 7233 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51496] 47 51496 1516502 1287064
>>> 11948032 7685 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51491] 47 51491 1661 18
>>> 53248 55 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51487] 47 51487 1661 18
>>> 57344 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51474] 47 51474 1661 21
>>> 49152 52 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51471] 47 51471 1516502 1277820
>>> 11890688 7964 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51465] 47 51465 1661 20
>>> 53248 53 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51455] 47 51455 1661 23
>>> 57344 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51440] 47 51440 1661 18
>>> 49152 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51438] 47 51438 1661 23
>>> 49152 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51428] 47 51428 1661 18
>>> 49152 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51414] 47 51414 1661 16
>>> 53248 56 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51405] 47 51405 1661 20
>>> 53248 52 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51397] 47 51397 1661 14
>>> 45056 59 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51375] 47 51375 1661 15
>>> 49152 58 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43437] 47 43437 16779222 14595446
>>> 122212352 129972 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43343] 47 43343 1661 2
>>> 49152 70 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43183] 47 43183 42625 4
>>> 106496 608 0 (sd-pam)
>>> Nov 06 10:07:05 sensor1 kernel: [ 43181] 47 43181 5290 0
>>> 77824 342 0 systemd
>>> Nov 06 10:07:05 sensor1 kernel: [ 996] 0 996 1346 0
>>> 49152 33 0 agetty
>>> Nov 06 10:07:05 sensor1 kernel: [ 995] 0 995 1403 0
>>> 45056 32 0 agetty
>>> Nov 06 10:07:05 sensor1 kernel: [ 994] 0 994 3963 38
>>> 69632 178 -1000 sshd
>>> Nov 06 10:07:05 sensor1 kernel: [ 991] 0 991 2119 20
>>> 53248 35 0 cron
>>> Nov 06 10:07:05 sensor1 kernel: [ 990] 105 990 2210 97
>>> 57344 55 -900 dbus-daemon
>>> Nov 06 10:07:05 sensor1 kernel: [ 977] 0 977 56456 90
>>> 86016 378 0 rsyslogd
>>> Nov 06 10:07:05 sensor1 kernel: [ 968] 0 968 4875 78
>>> 77824 189 0 systemd-logind
>>> Nov 06 10:07:05 sensor1 kernel: [ 944] 0 944 2021 1
>>> 57344 799 0 haveged
>>> Nov 06 10:07:05 sensor1 kernel: [ 931] 100 931 23270 26
>>> 86016 207 0 systemd-timesyn
>>> Nov 06 10:07:05 sensor1 kernel: [ 640] 0 640 5579 14
>>> 69632 309 -1000 systemd-udevd
>>> Nov 06 10:07:05 sensor1 kernel: [ 615] 0 615 74225 2624
>>> 638976 568 0 systemd-journal
>>> Nov 06 10:07:05 sensor1 kernel: [ pid ] uid tgid total_vm rss
>>> pgtables_bytes swapents oom_score_adj name
>>> Nov 06 10:07:05 sensor1 kernel: Tasks state (memory values in pages):
>>> Nov 06 10:07:05 sensor1 kernel: 0 pages hwpoisoned
>>> Nov 06 10:07:05 sensor1 kernel: 551979 pages reserved
>>> Nov 06 10:07:05 sensor1 kernel: 0 pages HighMem/MovableOnly
>>> Nov 06 10:07:05 sensor1 kernel: 33528566 pages RAM
>>> Nov 06 10:07:05 sensor1 kernel: Total swap = 976892kB
>>> Nov 06 10:07:05 sensor1 kernel: Free swap = 0kB
>>> Nov 06 10:07:05 sensor1 kernel: Swap cache stats: add 759161, delete
>>> 711025, find 25951/48832
>>> Nov 06 10:07:05 sensor1 kernel: 48128 pages in swap cache
>>> Nov 06 10:07:05 sensor1 kernel: 52774 total pagecache pages
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal: 73*4kB (U) 150*8kB (UH)
>>> 59*16kB (UH) 1325*32kB (UH) 5*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB
>>> 0*2048kB 0*4096kB = 45412kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal: 78*4kB (UMEH) 213*8kB
>>> (UEH) 70*16kB (UH) 1232*32kB (UMH) 7*64kB (H) 0*128kB 0*256kB 0*512kB
>>> 0*1024kB 0*2048kB 0*4096kB = 43008kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32: 697*4kB (UME) 647*8kB
>>> (UME) 392*16kB (UME) 190*32kB (UME) 94*64kB (UE) 51*128kB (UE) 63*256kB
>>> (UME) 240*512kB (UME) 72*1024kB (UME) 0*2048kB 0*4096kB = 245596kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA: 1*4kB (U) 1*8kB (U) 2*16kB
>>> (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB
>>> (M) 3*4096kB (M) = 15884kB
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal free:44844kB min:45116kB
>>> low:111144kB high:177172kB active_anon:48629496kB inactive_anon:3250728kB
>>> active_file:64kB inactive_file:668kB unevictable:0kB writepending:0kB
>>> present:67094528kB managed:66031300kB mlocked:0kB kernel_stack:5352kB
>>> pagetables:146976kB bounce
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal free:41932kB min:42604kB
>>> low:104956kB high:167308kB active_anon:55633904kB inactive_anon:4082748kB
>>> active_file:560kB inactive_file:0kB unevictable:0kB writepending:0kB
>>> present:63438848kB managed:62359848kB mlocked:0kB kernel_stack:4776kB
>>> pagetables:133008kB bounce:
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 60894 60894 60894
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32 free:245356kB min:2372kB
>>> low:5844kB high:9316kB active_anon:3199852kB inactive_anon:44556kB
>>> active_file:100kB inactive_file:0kB unevictable:0kB writepending:0kB
>>> present:3564892kB managed:3499316kB mlocked:0kB kernel_stack:48kB
>>> pagetables:3092kB bounce:0kB free_pcp:4
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 3392 64286 64286
>>> 64286
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA free:15884kB min:8kB low:20kB
>>> high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB
>>> inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB
>>> managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
>>> free_pcp:0kB local_pcp:0kB free_cma:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 active_anon:48629496kB
>>> inactive_anon:3250728kB active_file:108kB inactive_file:700kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:13636088kB
>>> dirty:0kB writeback:0kB shmem:9996kB shmem_thp: 0kB shmem_pmdmapped: 0kB
>>> anon_thp: 1337344kB writeback_tmp:0kB unst
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 active_anon:58833756kB
>>> inactive_anon:4127304kB active_file:548kB inactive_file:0kB unevictable:0kB
>>> isolated(anon):0kB isolated(file):0kB mapped:2101152kB dirty:0kB
>>> writeback:0kB shmem:7828kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp:
>>> 13608960kB writeback_tmp:0kB unstab
>>> Nov 06 10:07:05 sensor1 kernel: active_anon:26865813
>>> inactive_anon:1844508 isolated_anon:0
>>> active_file:163 inactive_file:0
>>> isolated_file:0
>>> unevictable:0 dirty:0 writeback:0
>>> unstable:0
>>> slab_reclaimable:25057
>>> slab_unreclaimable:65745
>>> mapped:3934289 shmem:4456
>>> pagetables:70769 bounce:0
>>> free:87073 free_pcp:3080 free_cma:0
>>> Nov 06 10:07:05 sensor1 kernel: Mem-Info:
>>> Nov 06 10:07:05 sensor1 kernel: R13: 0000000000001800 R14:
>>> 00007f375bd4f780 R15: 0000000000000022
>>> Nov 06 10:07:05 sensor1 kernel: R10: 00007f3618971000 R11:
>>> 00007f36189727e0 R12: 00007f375bd4f940
>>> Nov 06 10:07:05 sensor1 kernel: RBP: 00007f361424b000 R08:
>>> 0000000000000000 R09: 00007f36189727e0
>>> Nov 06 10:07:05 sensor1 kernel: RDX: 0000000000001780 RSI:
>>> 00007f361424b000 RDI: 00007f3618971000
>>> Nov 06 10:07:05 sensor1 kernel: RAX: 00007f3618971000 RBX:
>>> 0000000000001800 RCX: 00007f361424c760
>>> Nov 06 10:07:05 sensor1 kernel: RSP: 002b:00007ffd609fbc58 EFLAGS:
>>> 00010202
>>> Nov 06 10:07:05 sensor1 kernel: Code: Bad RIP value.
>>> Nov 06 10:07:05 sensor1 kernel: RIP: 0033:0x7f375bf1f925
>>> Nov 06 10:07:05 sensor1 kernel: page_fault+0x1e/0x30
>>> Nov 06 10:07:05 sensor1 kernel: ? page_fault+0x8/0x30
>>> Nov 06 10:07:05 sensor1 kernel: __do_page_fault+0x249/0x4f0
>>> Nov 06 10:07:05 sensor1 kernel: handle_mm_fault+0xd6/0x200
>>> Nov 06 10:07:05 sensor1 kernel: ? __switch_to_asm+0x41/0x70
>>> Nov 06 10:07:05 sensor1 kernel: __handle_mm_fault+0x958/0x1270
>>> Nov 06 10:07:05 sensor1 kernel: alloc_pages_vma+0x74/0x1c0
>>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_nodemask+0x28b/0x2b0
>>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_slowpath+0xbd8/0xcb0
>>> Nov 06 10:07:05 sensor1 kernel: out_of_memory+0x1a5/0x430
>>> Nov 06 10:07:05 sensor1 kernel: oom_kill_process.cold.30+0xb/0x1cf
>>> Nov 06 10:07:05 sensor1 kernel: dump_header+0x6b/0x283
>>> Nov 06 10:07:05 sensor1 kernel: dump_stack+0x5c/0x80
>>> Nov 06 10:07:05 sensor1 kernel: Call Trace:
>>> Nov 06 10:07:05 sensor1 kernel: Hardware name: Supermicro Super
>>> Server/H11SSL-i, BIOS 1.0b 04/27/2018
>>> Nov 06 10:07:05 sensor1 kernel: CPU: 3 PID: 51496 Comm: bro Not tainted
>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u1
>>> Nov 06 10:07:05 sensor1 kernel: bro cpuset=/ mems_allowed=1,3
>>> Nov 06 10:07:05 sensor1 kernel: bro invoked oom-killer:
>>> gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),
>>> order=0, oom_score_adj=0
>>>
>>> It looks like bro was using 64 GB of RAM (summation of even though the
>>> system has 128 GB installed. (It also looks like there was no free swap.)
>>> Searching the Internet for ways to solve OOM, people suggested turning off
>>> over-commit, and configuring the overcommit ratio a bit higher. Currently,
>>> the values are:
>>>
>>> # sysctl vm.overcommit_memory
>>> vm.overcommit_memory = 0
>>> # sysctl vm.overcommit_ratio
>>> vm.overcommit_ratio = 50
>>>
>>> In one case the suggestion was to change those to vm.overcommit_memory
>>> = 1 and vm.overcommit_ratio = 80 which seems reasonable based on the
>>> description at
>>> https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
>>>
>>> Or is there a better way to handle OOM?
>>>
>>> Mark
>>> _______________________________________________
>>> Zeek mailing list
>>> zeek at zeek.org
>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>>
>> _______________________________________________
>> Zeek mailing list
>> zeek at zeek.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191106/d53434e2/attachment-0001.html
More information about the Zeek
mailing list