[Zeek] Workers dying with "out of memory in new"

Fri Oct 18 08:12:20 PDT 2019

Interestingly enough, we started suffering the same problem at the same
time.

- 1 node with 44 cores, 256GB of RAM
- Zeek 2.5.5
- node.cfg:
  [worker-1]

type=worker

host=localhost

interface=af_packet::ens4f0

lb_method=custom

lb_procs=25

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24

- broctl.cfg:

MemLimit = 100000000 #100GB

setcap.enabled=1

On Fri, Oct 18, 2019 at 10:48 AM Mark Gardner <mkg at vt.edu> wrote:

> We must have crossed some threshold yesterday. Suddenly we are suffering
> an epidemic of workers dying with "out of memory in new" even though we
> made no changes. Previously, we would have a few die each day. Now we have
> had 250 alerts of workers dying and being restarted from 00:00 to 10:00. I
> have no idea where to start debugging the problem. Any suggestions?
>
> What causes a worker to die by running out of memory? The sensors have
> lots of memory (see below) so I would not expect to have any out of memory
> deaths. (To monitor the problem, I am in the process of setting up collectd
> and graphana.)
>
> Some details:
> - 5 sensors, each with 16-core, AMD Epyc 7351P, 128 GB RAM, Intel X520-T2
> - Zeek 2.6.1
> - node.cfg: lb_procs=15, pin_cpus=1-15,
> af_packet_buffer_size=1*1024*1024*1024
> - broctl.cfg: setcap enabled
> - Not shunting any traffic
>
> Mark
> --
> Mark Gardner
> --
> _______________________________________________
> Zeek mailing list
> zeek at zeek.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/zeek

-- 
Munroe Sollog
Senior Network Engineer
munroe at lehigh.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191018/d6ebb2fd/attachment.html