[Zeek-Dev] Zeek Supervisor Command-Line Client

Thu Jun 18 07:45:33 PDT 2020

Thanks Robin, that helps.

On Thu, Jun 18, 2020 at 2:11 AM Robin Sommer <robin at corelight.com> wrote:

>
> There are two parts here: (1) deploying the Zeek installation itself,
> and (2) deploying any configuration changes (incl. new Zeek scripts).
>
> For (1), the above applies: we'll rely on standard sysadmin processes
> for updating. That means you'd use "zeekcl" to shutdown the cluster
> processes, then run "yum update" (or whatever), then use "zeekcl"
> again to start things up again. (The Zeek supervisor will be running
> already at that point, managaged through systemd or whatever you're
> using).
>
> (2) is still a bit up in the air. With 3.2, there won't be any support
> for distributing configurations automatically, but we could add that
> so that config files/scripts/packages do get copied around over
> Broker. Feedback would be appreciated here: What's better, having
> zeekcl manage that, or leave it to standard sysadmin process as well?
>

I re-read the design doc, and I think that the part I missed the first time
through was suicide on orphaning. (Side-note: Given the much-needed trend
towards bias-free terminology in technology, perhaps there's a better term
here). My main concern was Broker version incompatibilities between the
newly-installed zcl, and the running cluster, which I think would be
addressed by that (i.e. to stop a cluster, you stop the supervisor service
on the manager, and then the other services will lose their connection and
also stop).

I'm still a bit unclear on how to start a cluster. In my mind, where simply
using the standard process/job control falls short is the need to operate
across multiple physical systems. So, would that be a job for zcl? Or would
the desired goal be that I have my, say, systemd unit set to constantly be
restarting Zeek on my worker systems? If it can't connect to the manager,
it would presumably immediately die given the orphaned state.

The more tightly we couple the nodes together, the more quickly it'll
detect failures, but the more sensitive it will be to flapping and
unnecessary restarts. The cluster is relatively fragile right now (e.g. a
manager node going away even for a brief period of time tends to lead to a
crash, as on an even relatively busy system, as the backlog won't clear as
timers and other events stack up). So I think that if we're moving cluster
supervision out of a parallel process in `zeekctl cron` and into Zeek
itself, we'll need to improve error detection and graceful recovery where
possible.

  --Vlad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.icsi.berkeley.edu/pipermail/zeek-dev/attachments/20200618/4a888290/attachment.html