<div dir="ltr"><div dir="ltr"><div>Thanks Robin, that helps.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jun 18, 2020 at 2:11 AM Robin Sommer &lt;<a href="mailto:robin@corelight.com">robin@corelight.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

There are two parts here: (1) deploying the Zeek installation itself,<br>

and (2) deploying any configuration changes (incl. new Zeek scripts).<br>

<br>

For (1), the above applies: we&#39;ll rely on standard sysadmin processes<br>

for updating. That means you&#39;d use &quot;zeekcl&quot; to shutdown the cluster<br>

processes, then run &quot;yum update&quot; (or whatever), then use &quot;zeekcl&quot;<br>

again to start things up again. (The Zeek supervisor will be running<br>

already at that point, managaged through systemd or whatever you&#39;re<br>

using).<br>

<br>

(2) is still a bit up in the air. With 3.2, there won&#39;t be any support<br>

for distributing configurations automatically, but we could add that<br>

so that config files/scripts/packages do get copied around over<br>

Broker. Feedback would be appreciated here: What&#39;s better, having<br>

zeekcl manage that, or leave it to standard sysadmin process as well?<br></blockquote><div><br></div><div>I re-read the design doc, and I think that the part I missed the first time through was suicide on orphaning. (Side-note: Given the much-needed trend towards bias-free terminology in technology, perhaps there&#39;s a better term here). My main concern was Broker version incompatibilities between the newly-installed zcl, and the running cluster, which I think would be addressed by that (i.e. to stop a cluster, you stop the supervisor service on the manager, and then the other services will lose their connection and also stop).<br></div><div><br></div><div>I&#39;m still a bit unclear on how to start a cluster. In my mind, where simply using the standard process/job control falls short is the need to operate across multiple physical systems. So, would that be a job for zcl? Or would the desired goal be that I have my, say, systemd unit set to constantly be restarting Zeek on my worker systems? If it can&#39;t connect to the manager, it would presumably immediately die given the orphaned state.<br></div><div><br></div>The more tightly we couple the nodes together, the more quickly it&#39;ll detect failures, but the more sensitive it will be to flapping and unnecessary restarts. The cluster is relatively fragile right now (e.g. a manager node going away even for a brief period of time tends to lead to a crash, as on an even relatively busy system, as the backlog won&#39;t clear as timers and other events stack up). So I think that if we&#39;re moving cluster supervision out of a parallel process in `zeekctl cron` and into Zeek itself, we&#39;ll need to improve error detection and graceful recovery where possible.</div><div class="gmail_quote"><br></div><div class="gmail_quote">  --Vlad<br></div><div class="gmail_quote"><div><br></div></div></div>