[Zeek] Worker being "killed nohup"

Thu Nov 7 07:37:18 PST 2019

Your comments have been most helpful, Michał. I am very appreciative.

On Wed, Nov 6, 2019 at 10:18 PM Michał Purzyński <michalpurzynski1 at gmail.com>
wrote:

> Now, this data expires (unless you have a script that never does that),
> but it might be the amount of state grows too quickly and the expiration is
> not quick enough, to free up some memory.
>

One hypothesis is that we have lots of very long running connections.
Looking through the first 1M connections in the largest 1 hour conn.log for
2019-10-21 (the biggest log in the past month), the min, max, and average
duration of connections are 0.0, 16750, and 37.4s (with stdev of 285.5)
respectively. 66.9% of the connections last less than a second and the
percentage of connections lasting to the mean duration is 86.4%. So the
number of long running connections seems small.

So it doesn't seem like there are tons of long running connections. We just
have many connections. (73,403,294 during the single hour mentioned above.)

> My quick suspect would be the scan.bro / scan.zeek old script that comes
> bundled with Zeek. If you have it enabled, disable and see if you're still
> crashing.
>

We commented out @load misc/scan in bro-2.6.4/share/bro/site/local.bro. Is
that what you meant?

You can then take a look at your scripts and see if there is some data
> structure that will grow per connection, over time - and how quickly you
> purge data from it.
>

The only extra configuration we added to local.bro was @load
tuning/json-logs. But I wouldn't think that would cause a large increase in
memory use (even though it does cause an increase in written file size).

One of the things I did yesterday was add 128 GB of swap file on NVMe to
the workers to augment the 1 GB swap partition already in place. That seems
to be helping. The one sensor I checked this morning was using 6.5 GB of
swap. Yesterday it would have crash exceeding 1 GB swap. So maybe I don't
need to worry about long running connections if I have enough swap.

I need to set up monitoring on the cluster to make it easier to diagnose
these kinds of problems.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20191107/a9bcec0b/attachment.html