[Zeek] Worker being "killed nohup"
Michał Purzyński
michalpurzynski1 at gmail.com
Wed Nov 6 19:18:12 PST 2019
Sure!
Zeek internally keeps a state for each connection - and quite possibly
information in other data structures that your scripts created. If you have
lots of events and each of them adds somehow data into those data
structures, it might eventually eat all your RAM.
To give you an example. An intern of mine, a while ago, designed a
brilliant script
event http_reply(c: connection, version: string, code: count,
reason: string) &priority=3
(...)
if (code >= 400) {
add c$http$tags[HTTP_ERROR];
SumStats::observe("http.excessive_errors.attacker_badhits",
[$host=c$id$orig_h],
[$str=fmt("%d %s%s", code, c$http$host,
c$http$uri)]);
SumStats::observe("http.excessive_errors.victim_badhits",
[$host=c$id$resp_h],
[$str=fmt("%d %s%s", code, c$http$host,
c$http$uri)]);
}
else if (code < 400) {
SumStats::observe("http.excessive_errors.attacker_goodhits",
[$host=c$id$orig_h], []);
SumStats::observe("http.excessive_errors.victim_goodhits",
[$host=c$id$resp_h], []);
This script will store, per inbound connection, a ratio of good vs bad HTTP
transactions. That can be used to do a behaviour profiling of a client. A
scanner would easily have mostly "bad hits" with return code >400.
This all worked well for 15 minutes and then 11 nodes x 64GB RAM each
cluster ran out of memory. See, this cluster monitored (among other things)
the Firefox update system, with 500 000 000 clients talking to it.
Why did it crash? Because the SumStats framework was adding data, into
internal data structures, per connection.
Notice how there is technically no memory leak. Given 2TB of RAM this maybe
would have worked.
Basically, if you have tons of connections (not simply bandwidth or
packets) the amount of memory necessary to keep all of them might be simply
too much.
Now, this data expires (unless you have a script that never does that), but
it might be the amount of state grows too quickly and the expiration is not
quick enough, to free up some memory.
My quick suspect would be the scan.bro / scan.zeek old script that comes
bundled with Zeek. If you have it enabled, disable and see if you're still
crashing.
You can then take a look at your scripts and see if there is some data
structure that will grow per connection, over time - and how quickly you
purge data from it.
On Wed, Nov 6, 2019 at 4:23 PM Jeffry Lang <jefflang at vt.edu> wrote:
> Can you expand on "tons of state"? It sounds like something I'm familiar
> with but want to make sure. I work with Mark and we've been talking a lot
> about this.
>
> On Wed, Nov 6, 2019, 6:52 PM Michał Purzyński <michalpurzynski1 at gmail.com>
> wrote:
>
>> Let's dig into the root cause.
>> Your OOM most likely happens because either
>> - you have tons of state and hit something that Justin neatly described
>> as "unbound state growth"
>> - or your logger threads cannot write messages quickly enough
>>
>> Given the OOM output it looks to me like workers are to blame. Do you
>> have any custom scripts? Do you use the stock scan.bro or any other
>> SumStats scripts?
>>
>> Can you observe the system for a while, in something like htop, or top
>> with threads, to see what kind of processes eat most of memory and report
>> back?
>>
>>
>> On Wed, Nov 6, 2019 at 11:13 AM Mark Gardner <mkg at vt.edu> wrote:
>>
>>> On Wed, Nov 6, 2019 at 1:13 PM Jeff Barber <jsbarber60 at gmail.com> wrote:
>>>
>>>> > I'm running Zeek 2.6.4 and I have been seeing occasional error
>>>> messages of the form:
>>>> >
>>>> > run-bro: line 110: 42574 Killed nohup
>>>> ${pin_command} $pin_cpu "$mybro" "$@"
>>>> >
>>>> > The workers seem to be restarted fine and other than the error
>>>> message, I haven't noticed any ill behavior. What should I do about the
>>>> error messages?
>>>>
>>>> I would check your syslog. Assuming you are running linux, if your
>>>> system runs out of memory, the kernel will go find the biggest process
>>>> and kill it. This can often be a zeek process as they tend to grow large as
>>>> more connections are tracked (depending on many factors: what scripts you
>>>> are running, what you're logging, what kind of traffic is being seen,
>>>> etc.). If that's happening, you should see something in syslog containing
>>>> the string "invoked oom-killer:" If you look at the surrounding lines,
>>>> there should be some info on process sizes showing why it was selected.
>>>>
>>>
>>> Good catch Jeff. A crash email arrived at 10:10 and the journal entry
>>> starting around 10:07 contained the following:
>>>
>>> Nov 06 10:07:08 sensor1 kernel: oom_reaper: reaped process 43437 (bro),
>>> now anon-rss:0kB, file-rss:1048576kB, shmem-rss:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Killed process 43437 (bro)
>>> total-vm:67116888kB, anon-rss:57339476kB, file-rss:1042296kB, shmem-rss:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Out of memory: Kill process 43437 (bro)
>>> score 444 or sacrifice child
>>> Nov 06 10:07:05 sensor1 kernel: [ 54824] 47 54824 1516502 1221530
>>> 11284480 80 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 54817] 47 54817 1661 56
>>> 49152 16 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51543] 47 51543 1516502 1281056
>>> 11886592 3616 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51542] 47 51542 1516502 1285841
>>> 11964416 2296 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51539] 47 51539 1516502 1277957
>>> 11890688 2817 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51537] 47 51537 1516502 1282107
>>> 11829248 2543 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51533] 47 51533 1516502 1276417
>>> 11939840 7591 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51532] 47 51532 1516502 1290264
>>> 12042240 8006 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51523] 47 51523 1516502 1281160
>>> 11894784 7479 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51521] 47 51521 1516502 1285423
>>> 11984896 7957 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51520] 47 51520 1745878 1370574
>>> 12668928 6908 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51501] 47 51501 1516502 1286290
>>> 11788288 6967 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51499] 47 51499 1661 22
>>> 49152 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51498] 47 51498 1516502 1276891
>>> 11882496 7233 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51496] 47 51496 1516502 1287064
>>> 11948032 7685 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51491] 47 51491 1661 18
>>> 53248 55 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51487] 47 51487 1661 18
>>> 57344 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51474] 47 51474 1661 21
>>> 49152 52 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51471] 47 51471 1516502 1277820
>>> 11890688 7964 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51465] 47 51465 1661 20
>>> 53248 53 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51455] 47 51455 1661 23
>>> 57344 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51440] 47 51440 1661 18
>>> 49152 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51438] 47 51438 1661 23
>>> 49152 49 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51428] 47 51428 1661 18
>>> 49152 54 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51414] 47 51414 1661 16
>>> 53248 56 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51405] 47 51405 1661 20
>>> 53248 52 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51397] 47 51397 1661 14
>>> 45056 59 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 51375] 47 51375 1661 15
>>> 49152 58 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43437] 47 43437 16779222 14595446
>>> 122212352 129972 0 bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43343] 47 43343 1661 2
>>> 49152 70 0 run-bro
>>> Nov 06 10:07:05 sensor1 kernel: [ 43183] 47 43183 42625 4
>>> 106496 608 0 (sd-pam)
>>> Nov 06 10:07:05 sensor1 kernel: [ 43181] 47 43181 5290 0
>>> 77824 342 0 systemd
>>> Nov 06 10:07:05 sensor1 kernel: [ 996] 0 996 1346 0
>>> 49152 33 0 agetty
>>> Nov 06 10:07:05 sensor1 kernel: [ 995] 0 995 1403 0
>>> 45056 32 0 agetty
>>> Nov 06 10:07:05 sensor1 kernel: [ 994] 0 994 3963 38
>>> 69632 178 -1000 sshd
>>> Nov 06 10:07:05 sensor1 kernel: [ 991] 0 991 2119 20
>>> 53248 35 0 cron
>>> Nov 06 10:07:05 sensor1 kernel: [ 990] 105 990 2210 97
>>> 57344 55 -900 dbus-daemon
>>> Nov 06 10:07:05 sensor1 kernel: [ 977] 0 977 56456 90
>>> 86016 378 0 rsyslogd
>>> Nov 06 10:07:05 sensor1 kernel: [ 968] 0 968 4875 78
>>> 77824 189 0 systemd-logind
>>> Nov 06 10:07:05 sensor1 kernel: [ 944] 0 944 2021 1
>>> 57344 799 0 haveged
>>> Nov 06 10:07:05 sensor1 kernel: [ 931] 100 931 23270 26
>>> 86016 207 0 systemd-timesyn
>>> Nov 06 10:07:05 sensor1 kernel: [ 640] 0 640 5579 14
>>> 69632 309 -1000 systemd-udevd
>>> Nov 06 10:07:05 sensor1 kernel: [ 615] 0 615 74225 2624
>>> 638976 568 0 systemd-journal
>>> Nov 06 10:07:05 sensor1 kernel: [ pid ] uid tgid total_vm rss
>>> pgtables_bytes swapents oom_score_adj name
>>> Nov 06 10:07:05 sensor1 kernel: Tasks state (memory values in pages):
>>> Nov 06 10:07:05 sensor1 kernel: 0 pages hwpoisoned
>>> Nov 06 10:07:05 sensor1 kernel: 551979 pages reserved
>>> Nov 06 10:07:05 sensor1 kernel: 0 pages HighMem/MovableOnly
>>> Nov 06 10:07:05 sensor1 kernel: 33528566 pages RAM
>>> Nov 06 10:07:05 sensor1 kernel: Total swap = 976892kB
>>> Nov 06 10:07:05 sensor1 kernel: Free swap = 0kB
>>> Nov 06 10:07:05 sensor1 kernel: Swap cache stats: add 759161, delete
>>> 711025, find 25951/48832
>>> Nov 06 10:07:05 sensor1 kernel: 48128 pages in swap cache
>>> Nov 06 10:07:05 sensor1 kernel: 52774 total pagecache pages
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 hugepages_total=0
>>> hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal: 73*4kB (U) 150*8kB (UH)
>>> 59*16kB (UH) 1325*32kB (UH) 5*64kB (H) 2*128kB (H) 0*256kB 0*512kB 0*1024kB
>>> 0*2048kB 0*4096kB = 45412kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal: 78*4kB (UMEH) 213*8kB
>>> (UEH) 70*16kB (UH) 1232*32kB (UMH) 7*64kB (H) 0*128kB 0*256kB 0*512kB
>>> 0*1024kB 0*2048kB 0*4096kB = 43008kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32: 697*4kB (UME) 647*8kB
>>> (UME) 392*16kB (UME) 190*32kB (UME) 94*64kB (UE) 51*128kB (UE) 63*256kB
>>> (UME) 240*512kB (UME) 72*1024kB (UME) 0*2048kB 0*4096kB = 245596kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA: 1*4kB (U) 1*8kB (U) 2*16kB
>>> (U) 1*32kB (U) 3*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB
>>> (M) 3*4096kB (M) = 15884kB
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 Normal free:44844kB min:45116kB
>>> low:111144kB high:177172kB active_anon:48629496kB inactive_anon:3250728kB
>>> active_file:64kB inactive_file:668kB unevictable:0kB writepending:0kB
>>> present:67094528kB managed:66031300kB mlocked:0kB kernel_stack:5352kB
>>> pagetables:146976kB bounce
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 0 0 0
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 Normal free:41932kB min:42604kB
>>> low:104956kB high:167308kB active_anon:55633904kB inactive_anon:4082748kB
>>> active_file:560kB inactive_file:0kB unevictable:0kB writepending:0kB
>>> present:63438848kB managed:62359848kB mlocked:0kB kernel_stack:4776kB
>>> pagetables:133008kB bounce:
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 0 60894 60894 60894
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA32 free:245356kB min:2372kB
>>> low:5844kB high:9316kB active_anon:3199852kB inactive_anon:44556kB
>>> active_file:100kB inactive_file:0kB unevictable:0kB writepending:0kB
>>> present:3564892kB managed:3499316kB mlocked:0kB kernel_stack:48kB
>>> pagetables:3092kB bounce:0kB free_pcp:4
>>> Nov 06 10:07:05 sensor1 kernel: lowmem_reserve[]: 0 3392 64286 64286
>>> 64286
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 DMA free:15884kB min:8kB low:20kB
>>> high:32kB active_anon:0kB inactive_anon:0kB active_file:0kB
>>> inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB
>>> managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB
>>> free_pcp:0kB local_pcp:0kB free_cma:0kB
>>> Nov 06 10:07:05 sensor1 kernel: Node 3 active_anon:48629496kB
>>> inactive_anon:3250728kB active_file:108kB inactive_file:700kB
>>> unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:13636088kB
>>> dirty:0kB writeback:0kB shmem:9996kB shmem_thp: 0kB shmem_pmdmapped: 0kB
>>> anon_thp: 1337344kB writeback_tmp:0kB unst
>>> Nov 06 10:07:05 sensor1 kernel: Node 1 active_anon:58833756kB
>>> inactive_anon:4127304kB active_file:548kB inactive_file:0kB unevictable:0kB
>>> isolated(anon):0kB isolated(file):0kB mapped:2101152kB dirty:0kB
>>> writeback:0kB shmem:7828kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp:
>>> 13608960kB writeback_tmp:0kB unstab
>>> Nov 06 10:07:05 sensor1 kernel: active_anon:26865813
>>> inactive_anon:1844508 isolated_anon:0
>>> active_file:163 inactive_file:0
>>> isolated_file:0
>>> unevictable:0 dirty:0 writeback:0
>>> unstable:0
>>> slab_reclaimable:25057
>>> slab_unreclaimable:65745
>>> mapped:3934289 shmem:4456
>>> pagetables:70769 bounce:0
>>> free:87073 free_pcp:3080 free_cma:0
>>> Nov 06 10:07:05 sensor1 kernel: Mem-Info:
>>> Nov 06 10:07:05 sensor1 kernel: R13: 0000000000001800 R14:
>>> 00007f375bd4f780 R15: 0000000000000022
>>> Nov 06 10:07:05 sensor1 kernel: R10: 00007f3618971000 R11:
>>> 00007f36189727e0 R12: 00007f375bd4f940
>>> Nov 06 10:07:05 sensor1 kernel: RBP: 00007f361424b000 R08:
>>> 0000000000000000 R09: 00007f36189727e0
>>> Nov 06 10:07:05 sensor1 kernel: RDX: 0000000000001780 RSI:
>>> 00007f361424b000 RDI: 00007f3618971000
>>> Nov 06 10:07:05 sensor1 kernel: RAX: 00007f3618971000 RBX:
>>> 0000000000001800 RCX: 00007f361424c760
>>> Nov 06 10:07:05 sensor1 kernel: RSP: 002b:00007ffd609fbc58 EFLAGS:
>>> 00010202
>>> Nov 06 10:07:05 sensor1 kernel: Code: Bad RIP value.
>>> Nov 06 10:07:05 sensor1 kernel: RIP: 0033:0x7f375bf1f925
>>> Nov 06 10:07:05 sensor1 kernel: page_fault+0x1e/0x30
>>> Nov 06 10:07:05 sensor1 kernel: ? page_fault+0x8/0x30
>>> Nov 06 10:07:05 sensor1 kernel: __do_page_fault+0x249/0x4f0
>>> Nov 06 10:07:05 sensor1 kernel: handle_mm_fault+0xd6/0x200
>>> Nov 06 10:07:05 sensor1 kernel: ? __switch_to_asm+0x41/0x70
>>> Nov 06 10:07:05 sensor1 kernel: __handle_mm_fault+0x958/0x1270
>>> Nov 06 10:07:05 sensor1 kernel: alloc_pages_vma+0x74/0x1c0
>>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_nodemask+0x28b/0x2b0
>>> Nov 06 10:07:05 sensor1 kernel: __alloc_pages_slowpath+0xbd8/0xcb0
>>> Nov 06 10:07:05 sensor1 kernel: out_of_memory+0x1a5/0x430
>>> Nov 06 10:07:05 sensor1 kernel: oom_kill_process.cold.30+0xb/0x1cf
>>> Nov 06 10:07:05 sensor1 kernel: dump_header+0x6b/0x283
>>> Nov 06 10:07:05 sensor1 kernel: dump_stack+0x5c/0x80
>>> Nov 06 10:07:05 sensor1 kernel: Call Trace:
>>> Nov 06 10:07:05 sensor1 kernel: Hardware name: Supermicro Super
>>> Server/H11SSL-i, BIOS 1.0b 04/27/2018
>>> Nov 06 10:07:05 sensor1 kernel: CPU: 3 PID: 51496 Comm: bro Not tainted
>>> 4.19.0-6-amd64 #1 Debian 4.19.67-2+deb10u1
>>> Nov 06 10:07:05 sensor1 kernel: bro cpuset=/ mems_allowed=1,3
>>> Nov 06 10:07:05 sensor1 kernel: bro invoked oom-killer:
>>> gfp_mask=0x6280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),
>>> order=0, oom_score_adj=0
>>>
>>> It looks like bro was using 64 GB of RAM (summation of even though the
>>> system has 128 GB installed. (It also looks like there was no free swap.)
>>> Searching the Internet for ways to solve OOM, people suggested turning off
>>> over-commit, and configuring the overcommit ratio a bit higher. Currently,
>>> the values are:
>>>
>>> # sysctl vm.overcommit_memory
>>> vm.overcommit_memory = 0
>>> # sysctl vm.overcommit_ratio
>>> vm.overcommit_ratio = 50
>>>
>>> In one case the suggestion was to change those to vm.overcommit_memory
>>> = 1 and vm.overcommit_ratio = 80 which seems reasonable based on the
>>> description at
>>> https://www.kernel.org/doc/Documentation/vm/overcommit-accounting.
>>>
>>> Or is there a better way to handle OOM?
>>>
>>> Mark
>>> _______________________________________________
>>> Zeek mailing list
>>> zeek at zeek.org
>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>>
>> _______________________________________________
>> Zeek mailing list
>> zeek at zeek.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191106/10313a63/attachment-0001.html
More information about the Zeek
mailing list