[Zeek] Dropping packets

Tue Feb 18 11:07:05 PST 2020

On Tue, Feb 18, 2020 at 1:57 PM Joseph Fischetti <
Joseph.Fischetti at marist.edu> wrote:

> Hmm.. are you running zeek as root or a regular user? may need to use
> sudo, or tweak the permissions on the /dev/myri.. (i think?) files.
>
> -          We’re running zeek as an unprivileged user.  I was able to go
> onto each worker as root and issue the commands to clear the counters.
>

I see.. I bet we could fix the myricom plugin to workaround this problem..
basically just treat the initial drop/link numbers as zero.  I had worked
around it so long ago just by always clearing the counters when the cluster
is restarted.. but never looked into a permanent fix.

> That doesn't make a lot of sense.. It's just the name, it doesn't affect
> anything else... Maybe it was just a coincidence?
>
> -          When we named the workers “worker-1-A”, etc, the processes
> that were started up were “worker-1-A-1” through “worker-1-A-N”.  Is it
> possible that zeekctl doesn’t parse something correctly when it checks to
> make sure the processes are running?  Zeekctl stop failed to stop the
> workers, zeeksctl status showed them as stopped.  They were still running
> and the logger was still working.  The only way to kill them was to do a
> “killall zeek” from the workers.
>
Yeah.. that's the weird part.. it doesn't parse anything or really care
about the names at all, it just keeps track of the pid.  You stopped the
old workers before renaming and starting the new ones right?  not doing so
can cause similar sounding problems,  but I'm not sure if that's what you
ran into.

That also doesn't make a lot of sense.   Pinning the workers just causes
> them to be started via cpuset locked to a core, everything else is the same.
>
> How exactly are the workers crashing?  Are you getting a crash report from
> zeekctl? or in dmesg?
>
> -          No crash reports that I recall.  I can try and force it to
> happen again tomorrow.  I’d like to let things go for 24 hours and see
> where we end up.  We’ve been running for 5 hours and our stats currently
> looks like this [1].  Note, I modified the script that you provided to work
> with more than 9 threads.  (host = re.sub(r'\-[0-9]*$','', node))
>
Nice :-)  Looks like your overall stats are pretty good with the cleared
counters and the extra workers.

> -          We’re currently running 10 threads per worker **unpinned**
>
> -          Medium load average on both boxes is around 6
>
> -          Memory usage on both boxes is around 75G.
>

https://github.com/corelight/zeek-ssl-clear-state and
https://github.com/corelight/zeek-smb-clear-state may help a bit with the
memory usage.. 75G is a bit high for only 10 workers.. .though some of that
reported number is  the myricom buffer I think.

> -          I’m going to find out if there’s something we can do in the
> Aristas so they send some more traffic to the secondary interface.  I need
> to find out how they’re configured and why.
>

It does look like something is setup weirdly if the traffic is not evenly
balanced.. Maybe you have 2 different capture groups setup instead of just
dumping everything into one group split across 4 nics? The closer you look
the more problems you find :-)

-- 
Justin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/zeek/attachments/20200218/b7ce0fc2/attachment.html