<div dir="ltr"><div dir="ltr"><div>Thanks Robin, that helps.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 18, 2020 at 2:11 AM Robin Sommer <<a href="mailto:robin@corelight.com">robin@corelight.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
There are two parts here: (1) deploying the Zeek installation itself,<br>
and (2) deploying any configuration changes (incl. new Zeek scripts).<br>
<br>
For (1), the above applies: we'll rely on standard sysadmin processes<br>
for updating. That means you'd use "zeekcl" to shutdown the cluster<br>
processes, then run "yum update" (or whatever), then use "zeekcl"<br>
again to start things up again. (The Zeek supervisor will be running<br>
already at that point, managaged through systemd or whatever you're<br>
using).<br>
<br>
(2) is still a bit up in the air. With 3.2, there won't be any support<br>
for distributing configurations automatically, but we could add that<br>
so that config files/scripts/packages do get copied around over<br>
Broker. Feedback would be appreciated here: What's better, having<br>
zeekcl manage that, or leave it to standard sysadmin process as well?<br></blockquote><div><br></div><div>I re-read the design doc, and I think that the part I missed the first time through was suicide on orphaning. (Side-note: Given the much-needed trend towards bias-free terminology in technology, perhaps there's a better term here). My main concern was Broker version incompatibilities between the newly-installed zcl, and the running cluster, which I think would be addressed by that (i.e. to stop a cluster, you stop the supervisor service on the manager, and then the other services will lose their connection and also stop).<br></div><div><br></div><div>I'm still a bit unclear on how to start a cluster. In my mind, where simply using the standard process/job control falls short is the need to operate across multiple physical systems. So, would that be a job for zcl? Or would the desired goal be that I have my, say, systemd unit set to constantly be restarting Zeek on my worker systems? If it can't connect to the manager, it would presumably immediately die given the orphaned state.<br></div><div><br></div>The more tightly we couple the nodes together, the more quickly it'll detect failures, but the more sensitive it will be to flapping and unnecessary restarts. The cluster is relatively fragile right now (e.g. a manager node going away even for a brief period of time tends to lead to a crash, as on an even relatively busy system, as the backlog won't clear as timers and other events stack up). So I think that if we're moving cluster supervision out of a parallel process in `zeekctl cron` and into Zeek itself, we'll need to improve error detection and graceful recovery where possible.</div><div class="gmail_quote"><br></div><div class="gmail_quote"> --Vlad<br></div><div class="gmail_quote"><div><br></div></div></div>