<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;">


<br class="">


<blockquote type="cite" class="">On May 22, 2018, at 10:26 AM, Robin Sommer &lt;<a href="mailto:robin@icir.org" class="">robin@icir.org</a>&gt; wrote:<br class="">


<br class="">


With such a large change, I'm sure there'll be some more kinks to iron<br class="">


out still; that's where we need everybody's help. If you have an<br class="">


environment where you can test drive new Bro versions, please give<br class="">


this a try. We're interested in any feedback you have, both specific<br class="">


issues you encounter (best to file tickets) and general experiences<br class="">


with the new version, including in particular any observations about<br class="">


performance (best to send to this list).<br class="">


</blockquote>


<div class=""><br class="">


</div>


<div class="">Have this running on a few clusters now, so far it's been really good. &nbsp;This graph shows how stable it has been on one of of clusters.</div>


<div class=""><br class="">


</div>


<div class=""><img apple-inline="yes" id="D90A83AC-B43C-47A8-B2D7-EF55449860D0" src="cid:57EB6001-3898-4F59-A4D9-BF080307A62F@home" class=""></div>


<div class=""><br class="">


</div>


<div class="">On that cluster manager node we were seeing a random cpu&#43;traffic&#43;memory spike on one of the proxies that would eventually be killed by the OOM killer.. then it would restart and get killed again shortly after that. &nbsp;A larger cluster would see


 the same spikes but it had 4x the ram and wouldn't OOM.</div>


<div class=""><br class="">


</div>


<div class="">Since switching to the broker version around 5/25, that completely stopped. &nbsp;The base CPU usage is a bit higher, but all the random spikes are gone. &nbsp;The base memory usage is also lower.</div>


<div class=""><br class="">


</div>


<div class="">I could never figure out what was causing the problem, and it's possible that &amp;synchronized not doing anything anymore is why it's better now. &nbsp;I'm mostly using &amp;synchronized for syncing input files across all the workers and one of them does


 have 300k entries in it. &nbsp;That file is fairly constant though, only a few k changes every 5 minutes and nothing that should use 20G of ram.</div>


<div class=""><br class="">


</div>


<div class="">I still need to replace all of our uses of &amp;synchronized. The config framework may work for most cases once the cluster bits are done, but probably not for syncing the 300k item set.</div>


<div class=""><br class="">


</div>


<div class="">Another GREAT thing I noticed that we may want to add to NEWS is that it looks like the 'file descriptors used per worker' is down from 5 (1 socket &#43; 2&#43;2 from pipes) to just 1 socket (no more flares?). &nbsp;This means that even though select() is


 still not gone, the limitation of ~175 workers for 2.5.x will go away and people would be able to run 500&#43; worker clusters if they wanted to.</div>


<div class=""><br class="">


</div>


<div class=""><br class="">


</div>


<div class=""><br class="">


</div>


<div class=""><br class="">


</div>


<br class="">


—&nbsp;<br class="">


Justin Azoff<br class="">


<br class="">


</body>


</html>