[Zeek] Workers occasionally using 102% CPU

Pete Nelson petiepooo at gmail.com
Tue Feb 25 08:39:02 PST 2020


Justtin,

We agree that a worker process of Zeek using 10-20% of the CPU when
essentially idle is a performance issue.  Thank you for your work on
that.  I'm looking forward to the day that it's rolled out in
SecurityOnion.

That said...

That a worker process of Zeek is taking 100% of the CPU when
essentially idle, but only sometimes and only after running for an
indeterminate amount of time, is a bug.  It's not just not performing
well, as is demonstrated by the strace output; select() is always
returning with a file descriptor ready for IO, which is not cleared by
reading it in the main thread, and thus the main thread never sleeps.
Its behavior is very different from the baseline with just the
performance issue.

To bolster that conclusion, I was doing my best to understand the
strace output I'd captured earlier, and sifting through the series of
select(), epoll_ctl(), futex(), rt_sigprocmask() and rt_sigtimedwait()
syscalls reminded me just how HARD multi-threaded programming can be.
And race conditions can be subtle with no discernible pattern in when
they manifest.  So I tried something: I tried pinning each worker
process to a CPU.

None of the threads have jumped up to 100% since.

FWIW, I've found another SecurityOnion user that is seeing the same
issue.  He reported that load averages skyrocketed after the latest
upgrade, and saw that 9 out of his 11 Zeek worker processes were
pegged at 100% CPU until he restarted it.  I'm going to suggest he pin
workers to specific CPUs to see if it goes away for him too.

>From strace output, it appears that the main thread and one of the
child threads get out of sync, in that there's an extra character in
the pipe used for intra-thread signalling that the main thread never
clears with a read.  Even when a new packet comes in, I think I see
that there's something else (eg. a signal being unmasked?) that
triggers the main thread to read from the pipe, but it still only
reads what it thinks are the right number of characters, rather than
all that are available in the pipe, leaving extra in the pipe select()
always returns that fd as ready for IO.

This may be specific to SecurityOnion's configuration or set of
ZeekScripts, or it may be happening on all instances where Zeek
workers have no cpu affinity.  Perf wouldn't help there, as we need
the specific sequence that triggers the race condition rather than a
heatmap of function calls.  I'd still like to get a trace that shows
the function call sequence at the moment this bug is triggered, but
haven't had any luck so far.

Does this make sense to you?  Trigger any memories of changes that
went into 3.0 that might have altered behavior in that area?  Do you
run your tests without pinning cpus?

I saw in a ZeekWeek2019 slide that 3.0.x is considered a long-term
stable release.  Please let me know if there's anything I can do to
help diagnose this further and make sure it's stable for everyone.
--
Pete


More information about the Zeek mailing list