[Zeek] Workers dying with "out of memory in new"

Fri Oct 18 08:26:13 PDT 2019

For additional reference:

Linux snout 4.9.0-9-amd64 #1 SMP Debian 4.9.168-1+deb9u5 (2019-08-11)
x86_64 GNU/Linux

on 10-11 I patched libssl,and libc
on 10-17 I upgraded sudo (about 30 mins after the first worker crashed)

[Bro] Crash report from worker-1-12 email received at 16:00

Log output from dpkg for reference:

# less /var/log/dpkg.log |grep "status installed"

2019-10-11 14:59:23 status installed telegraf:amd64 1.12.3-1

2019-10-11 14:59:23 status installed libssl1.0.2:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:23 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-11 14:59:23 status installed libssl1.1:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:23 status installed openssl:amd64 1.1.0l-1~deb9u1

2019-10-11 14:59:24 status installed man-db:amd64 2.7.6.1-2

2019-10-11 14:59:24 status installed libssl1.0-dev:amd64 1.0.2t-1~deb9u1

2019-10-11 14:59:24 status installed libc-bin:amd64 2.24-11+deb9u4

2019-10-17 16:25:47 status installed sudo:amd64 1.8.19p1-2.1+deb9u1

2019-10-17 16:25:47 status installed apache2-utils:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-bin:amd64 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed apache2-data:all 2.4.25-3+deb9u9

2019-10-17 16:25:47 status installed systemd:amd64 232-25+deb9u12

2019-10-17 16:25:47 status installed man-db:amd64 2.7.6.1-2

2019-10-17 16:25:48 status installed apache2:amd64 2.4.25-3+deb9u9

On Fri, Oct 18, 2019 at 11:12 AM Munroe Sollog <mus3 at lehigh.edu> wrote:

> Interestingly enough, we started suffering the same problem at the same
> time.
>
> - 1 node with 44 cores, 256GB of RAM
> - Zeek 2.5.5
> - node.cfg:
>   [worker-1]
>
> type=worker
>
> host=localhost
>
> interface=af_packet::ens4f0
>
> lb_method=custom
>
> lb_procs=25
>
> pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24
>
>
> - broctl.cfg:
>
> MemLimit = 100000000 #100GB
>
> setcap.enabled=1
>
>
>
> On Fri, Oct 18, 2019 at 10:48 AM Mark Gardner <mkg at vt.edu> wrote:
>
>> We must have crossed some threshold yesterday. Suddenly we are suffering
>> an epidemic of workers dying with "out of memory in new" even though we
>> made no changes. Previously, we would have a few die each day. Now we have
>> had 250 alerts of workers dying and being restarted from 00:00 to 10:00. I
>> have no idea where to start debugging the problem. Any suggestions?
>>
>> What causes a worker to die by running out of memory? The sensors have
>> lots of memory (see below) so I would not expect to have any out of memory
>> deaths. (To monitor the problem, I am in the process of setting up collectd
>> and graphana.)
>>
>> Some details:
>> - 5 sensors, each with 16-core, AMD Epyc 7351P, 128 GB RAM, Intel X520-T2
>> - Zeek 2.6.1
>> - node.cfg: lb_procs=15, pin_cpus=1-15,
>> af_packet_buffer_size=1*1024*1024*1024
>> - broctl.cfg: setcap enabled
>> - Not shunting any traffic
>>
>> Mark
>> --
>> Mark Gardner
>> --
>> _______________________________________________
>> Zeek mailing list
>> zeek at zeek.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek
>
>
>
> --
> Munroe Sollog
> Senior Network Engineer
> munroe at lehigh.edu
>

-- 
Munroe Sollog
Senior Network Engineer
munroe at lehigh.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191018/62ed79f0/attachment-0001.html