[Zeek] Worker being "killed nohup"
Mark Gardner
mkg at vt.edu
Wed Nov 6 11:10:06 PST 2019
On Wed, Nov 6, 2019 at 1:13 PM Jeff Barber <jsbarber60 at gmail.com> wrote:
> > I'm running Zeek 2.6.4 and I have been seeing occasional error messages
> of the form:
> >
> > run-bro: line 110: 42574 Killed nohup ${pin_command}
> $pin_cpu "$mybro" "$@"
> >
> > The workers seem to be restarted fine and other than the error message,
> I haven't noticed any ill behavior. What should I do about the error
> messages?
>
> I would check your syslog. Assuming you are running linux, if your system
> runs out of memory, the kernel will go find the biggest process and kill
> it. This can often be a zeek process as they tend to grow large as more
> connections are tracked (depending on many factors: what scripts you are
> running, what you're logging, what kind of traffic is being seen, etc.). If
> that's happening, you should see something in syslog containing the string "invoked
> oom-killer:" If you look at the surrounding lines, there should be some
> info on process sizes showing why it was selected.
>
Good catch Jeff. A crash email arrived at 10:10 and the journal entry
starting around 10:07 contained the following:
Nov 06 10:07:08 sensor1 kernel: oom_reaper: reaped process 43437 (bro), now
anon-rss:0kB, file-rss:1048576kB, shmem-rss:0kB
Nov 06 10:07:05 sensor1 kernel: Killed process 43437 (bro)
total-vm:67116888kB, anon-rss:57339476kB, file-rss:1042296kB, shmem-rss:0kB
Nov 06 10:07:05 sensor1 kernel: Out of memory: Kill process 43437 (bro)
score 444 or sacrifice child
Nov 06 10:07:05 sensor1 kernel: [ 54824] 47 54824 1516502 1221530
11284480 80 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 54817] 47 54817 1661 56
49152 16 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51543] 47 51543 1516502 1281056
11886592 3616 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51542] 47 51542 1516502 1285841
11964416 2296 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51539] 47 51539 1516502 1277957
11890688 2817 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51537] 47 51537 1516502 1282107
11829248 2543 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51533] 47 51533 1516502 1276417
11939840 7591 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51532] 47 51532 1516502 1290264
12042240 8006 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51523] 47 51523 1516502 1281160
11894784 7479 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51521] 47 51521 1516502 1285423
11984896 7957 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51520] 47 51520 1745878 1370574
12668928 6908 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51501] 47 51501 1516502 1286290
11788288 6967 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51499] 47 51499 1661 22
49152 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51498] 47 51498 1516502 1276891
11882496 7233 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51496] 47 51496 1516502 1287064
11948032 7685 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51491] 47 51491 1661 18
53248 55 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51487] 47 51487 1661 18
57344 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51474] 47 51474 1661 21
49152 52 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51471] 47 51471 1516502 1277820
11890688 7964 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 51465] 47 51465 1661 20
53248 53 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51455] 47 51455 1661 23
57344 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51440] 47 51440 1661 18
49152 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51438] 47 51438 1661 23
49152 49 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51428] 47 51428 1661 18
49152 54 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51414] 47 51414 1661 16
53248 56 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51405] 47 51405 1661 20
53248 52 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51397] 47 51397 1661 14
45056 59 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 51375] 47 51375 1661 15
49152 58 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 43437] 47 43437 16779222 14595446
122212352 129972 0 bro
Nov 06 10:07:05 sensor1 kernel: [ 43343] 47 43343 1661 2
49152 70 0 run-bro
Nov 06 10:07:05 sensor1 kernel: [ 43183] 47 43183 42625 4
106496 608 0 (sd-pam)
Nov 06 10:07:05 sensor1 kernel: [ 43181] 47 43181 5290 0
77824 342 0 systemd
Nov 06 10:07:05 sensor1 kernel: [ 996] 0 996 1346 0
49152 33 0 agetty
Nov 06 10:07:05 sensor1 kernel: [ 995] 0 995 1403 0
45056 32 0 agetty
Nov 06 10:07:05 sensor1 kernel: [ 994] 0 994 3963 38
69632 178 -1000 sshd
Nov 06 10:07:05 sensor1 kernel: [ 991] 0 991 2119 20
53248 35 0 cron
Nov 06 10:07:05 sensor1 kernel: [ 990] 105 990 2210 97
57344 55 -900 dbus-daemon
Nov 06 10:07:05 sensor1 kernel: [ 977] 0 977 56456 90
86016 378 0 rsyslogd
Nov 06 10:07:05 sensor1 kernel: [ 968] 0 968 4875 78
77824 189 0 systemd-logind
Nov 06 10:07:05 sensor1 kernel: [ 944] 0 944 2021 1
57344 799 0 haveged
Nov 06 10:07:05 sensor1 kernel: [ 931] 100 931 23270 26
86016 207 0 systemd-timesyn
Nov 06 10:07:05 sensor1 kernel: [ 640] 0 640 5579 14
69632 309 -1000 systemd-udevd
Nov 06 10:07:05 sensor1 kernel: [ 615] 0 615 74225 2624
638976 568 0 systemd-journal
Nov 06 10:07:05 sensor1 kernel: [ pid ] uid tgid total_vm rss
pgtables_bytes swapents oom_score_adj name
Nov 06 10:07:05 sensor1 kernel: Tasks state (memory values in pages):
Nov 06 10:07:05 sensor1 kernel: 0 pages hwpoisoned
Nov 06 10:07:05 sensor1 kernel: 551979 pages reserved
Nov 06 10:07:05 sensor1 kernel: 0 pages HighMem/MovableOnly
Nov 06 10:07:05 sensor1 kernel: 33528566 pages RAM
Nov 06 10:07:05 sensor1 kernel: Total swap = 976892kB
Nov 06 10:07:05 sensor1 kernel: Free swap = 0kB
Nov 06 10:07:05 sensor1 kernel: Swap cache stats: add 759161, delete
711025, find 25951/48832
Nov 06 10:07:05 sensor1 kernel: 48128 pages in swap cache
Nov 06 10:07:05 sensor1 kernel: 52774 total pagecache pages
Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=1048576kB
Nov 06 10:07:05 sensor1 kernel: Node 3 Normal: 73*4kB (U) 150*8kB (UH)
59*16kB (UH) 1325*32kB (UH) 5*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 45412kB
Nov 06 10:07:05 sensor1 kernel: Node 1 Normal: 78*4kB (UMEH) 213*8kB (UEH)
70*16kB (UH) 1232*32kB (UMH) 7*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB
0*2048kB 0*4096kB = 43008kB
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32: 697*4kB (UME) 647*8kB (UME)
392*16kB (UME) 190*32kB (UME) 94*64kB (UE) 51*128kB (UE) 63*256kB (UME)
240*512kB (UME) 72*1024kB (UME) 0*2048kB 0*4096kB = 245596kB
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA: 1*4kB (U) 1*8kB (U) 2*16kB (U)
1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M)
3*4096kB (M) = 15884kB
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
Nov 06 10:07:05 sensor1 kernel: Node 3 Normal free:44844kB min:45116kB
low:111144kB high:177172kB active_anon:48629496kB inactive_anon:3250728kB
active_file:64kB inactive_file:668kB unevictable:0kB writepending:0kB
present:67094528kB managed:66031300kB mlocked:0kB kernel_stack:5352kB
pagetables:146976kB bounce
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
Nov 06 10:07:05 sensor1 kernel: Node 1 Normal free:41932kB min:42604kB
low:104956kB high:167308kB active_anon:55633904kB inactive_anon:4082748kB
active_file:560kB inactive_file:0kB unevictable:0kB writepending:0kB
present:63438848kB managed:62359848kB mlocked:0kB kernel_stack:4776kB
pagetables:133008kB bounce:
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 60894 60894 60894
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32 free:245356kB min:2372kB
low:5844kB high:9316kB active_anon:3199852kB inactive_anon:44556kB
active_file:100kB inactive_file:0kB unevictable:0kB writepending:0kB
present:3564892kB managed:3499316kB mlocked:0kB kernel_stack:48kB
pagetables:3092kB bounce:0kB free_pcp:4
Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 3392 64286 64286 64286
Nov 06 10:07:05 sensor1 kernel: Node 1 DMA free:15884kB min:8kB low:20kB
high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB
inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB
managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
free_pcp:0kB local_pcp:0kB free_cma:0kB
Nov 06 10:07:05 sensor1 kernel: Node 3 active_anon:48629496kB
inactive_anon:3250728kB active_file:108kB inactive_file:700kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:13636088kB
dirty:0kB writeback:0kB shmem:9996kB shmem_thp: 0kB shmem_pmdmapped: 0kB
anon_thp: 1337344kB writeback_tmp:0kB unst
Nov 06 10:07:05 sensor1 kernel: Node 1 active_anon:58833756kB
inactive_anon:4127304kB active_file:548kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB mapped:2101152kB dirty:0kB
writeback:0kB shmem:7828kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp:
13608960kB writeback_tmp:0kB unstab
Nov 06 10:07:05 sensor1 kernel: active_anon:26865813 inactive_anon:1844508
isolated_anon:0
active_file:163 inactive_file:0
isolated_file:0
unevictable:0 dirty:0 writeback:0
unstable:0
slab_reclaimable:25057
slab_unreclaimable:65745
mapped:3934289 shmem:4456 pagetables:70769
bounce:0
free:87073 free_pcp:3080 free_cma:0
Nov 06 10:07:05 sensor1 kernel: Mem-Info:
Nov 06 10:07:05 sensor1 kernel: R13: 0000000000001800 R14: 00007f375bd4f780
R15: 0000000000000022
Nov 06 10:07:05 sensor1 kernel: R10: 00007f3618971000 R11: 00007f36189727e0
R12: 00007f375bd4f940
Nov 06 10:07:05 sensor1 kernel: RBP: 00007f361424b000 R08: 0000000000000000
R09: 00007f36189727e0
Nov 06 10:07:05 sensor1 kernel: RDX: 0000000000001780 RSI: 00007f361424b000
RDI: 00007f3618971000
Nov 06 10:07:05 sensor1 kernel: RAX: 00007f3618971000 RBX: 0000000000001800
RCX: 00007f361424c760
Nov 06 10:07:05 sensor1 kernel: RSP: 002b:00007ffd609fbc58 EFLAGS: 00010202
Nov 06 10:07:05 sensor1 kernel: Code: Bad RIP value.
Nov 06 10:07:05 sensor1 kernel: RIP: 0033:0x7f375bf1f925
Nov 06 10:07:05 sensor1 kernel: page_fault+0x1e/0x30
Nov 06 10:07:05 sensor1 kernel: ? page_fault+0x8/0x30
Nov 06 10:07:05 sensor1 kernel: __do_page_fault+0x249/0x4f0
Nov 06 10:07:05 sensor1 kernel: handle_mm_fault+0xd6/0x200
Nov 06 10:07:05 sensor1 kernel: ? __switch_to_asm+0x41/0x70
Nov 06 10:07:05 sensor1 kernel: __handle_mm_fault+0x958/0x1270
Nov 06 10:07:05 sensor1 kernel: alloc_pages_vma+0x74/0x1c0
Nov 06 10:07:05 sensor1 kernel: __alloc_pages_nodemask+0x28b/0x2b0
Nov 06 10:07:05 sensor1 kernel: __alloc_pages_slowpath+0xbd8/0xcb0
Nov 06 10:07:05 sensor1 kernel: out_of_memory+0x1a5/0x430
Nov 06 10:07:05 sensor1 kernel: oom_kill_process.cold.30+0xb/0x1cf
Nov 06 10:07:05 sensor1 kernel: dump_header+0x6b/0x283
Nov 06 10:07:05 sensor1 kernel: dump_stack+0x5c/0x80
Nov 06 10:07:05 sensor1 kernel: Call Trace:
Nov 06 10:07:05 sensor1 kernel: Hardware name: Supermicro Super
Server/H11SSL-i, BIOS 1.0b 04/27/2018
Nov 06 10:07:05 sensor1 kernel: CPU: 3 PID: 51496 Comm: bro Not tainted
4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u1
Nov 06 10:07:05 sensor1 kernel: bro cpuset=/ mems_allowed=1,3
Nov 06 10:07:05 sensor1 kernel: bro invoked oom-killer:
gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),
order=0, oom_score_adj=0
It looks like bro was using 64 GB of RAM (summation of even though the
system has 128 GB installed. (It also looks like there was no free swap.)
Searching the Internet for ways to solve OOM, people suggested turning off
over-commit, and configuring the overcommit ratio a bit higher. Currently,
the values are:
# sysctl vm.overcommit_memory
vm.overcommit_memory = 0
# sysctl vm.overcommit_ratio
vm.overcommit_ratio = 50
In one case the suggestion was to change those to vm.overcommit_memory = 1
and vm.overcommit_ratio = 80 which seems reasonable based on the
description at
https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
Or is there a better way to handle OOM?
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191106/0410889d/attachment-0001.html
More information about the Zeek
mailing list