[Bro] Sizing a Bro Cluster [was out of memory after a couple days]

Gary Faulkner gary at doit.wisc.edu
Fri Dec 6 18:40:33 PST 2013


Bob,

So, probably closer to 10-12 workers per host in my case since the 
proxies and manager are there? It was suggested to me early on to try to 
acquire another box to separate from manager and proxies, but I don't 
have one quite yet so I've been trying to make it work as is. It didn't 
seem like the worker child processes needed much CPU time, so I thought 
I could push the worker count higher and it also seemed like I got less 
loss per broctl netstat, but others have suggested that maybe the broctl 
netstat command wasn't the most reliable way to judge that. I ended up 
at 4 proxies mostly because two didn't seem stable and I like symmetry 
so I jumped straight to 4.

Thanks again,
Gary

On 12/6/2013 7:57 PM, Bob wrote:
> Gary,
>
> Ah, my apologies for my misreading of your specs.  Two hosts seeing 2-4Gb/s each should be just fine.
>
> As far as my recommendation, I definitely meant real cores.  I find, based on my own experimentation, that taking your number of workers higher than the number of physical cores in the box tends to hurt the performance instead of helping.  In the case where a box was also doing manager/proxy duty, I'd subtract a physical core for each of those as well (I've broken those out into a separate box to avoid that issue on mine).  I don't recall seeing anything from Vlad suggesting going higher, but I can't rule out the possibility that I missed it.
>
> Also to answer a question that I overlooked earlier, I'm running 3 proxies on that workload.  I got to that number by just increasing by one until it stabilized.
>
> Bob
>
> Gary Faulkner <gary at doit.wisc.edu> wrote:
>> Bob,
>>
>> Thank you for the feedback. To clarify I am currently using two
>> physical
>> hosts clustered together (using broctl), so each box ends up with 64G
>> of
>> RAM and 16cores/32 threads and Intel IXGBE 10G Cards +
>> pf_ring/DNA/libzero for distributing packets on the host. Each physical
>>
>> host then sees between 2-4Gbps and has 20 workers + 2 proxies. I recall
>>
>> reading a blog entry by Martin Holste where he mentioned allocating
>> only
>> half as many workers as you had logical cores/threads, but also seem to
>>
>> recall others (Vlad G.) pushing nearly as many workers as logical
>> cores,
>> but could have read to much into it. Are you referring to physical
>> cores
>> or logical cores/threads? If it is the former I think what you are
>> saying is inline with what Martin suggested; although I had hoped I
>> could push the worker count a bit higher based on what I thought I had
>> read elsewhere.
>>
>> Regards,
>> Gary
>>
>>   On 12/6/2013 7:08 PM, Bob wrote:
>>> Gary,
>>>
>>> Realistically, for that load I would recommend looking into a
>> cluster.  My personal sizing criteria is 4-5Gb/s max per box.  That's
>> for a box that's roughly the same as yours and that uses up about
>> 40-50GB of RAM per box (although for as cheap as it is, I recommend to
>> always guess high on the RAM).  For this size of box, you can probably
>> improve your performance by reducing the number of workers (one per
>> real core is a good benchmark).  I am conservative, so I like to keep a
>> couple if cores free for system tasks to ensure reliable performance
>> (setting the number of workers to something like 12 or 14), but that's
>> up to you.
>>>
>>> As a caveat, the aforementioned recommendations assume that you're
>> using a network card that's designed to do this work (like an Intel
>> ixgbe card with pf_ring or a Myricom).  If your card can't bypass the
>> kernel's interfaces, then you're going to need a lot more hardware to
>> get the same performance because you're spending CPU time shoving the
>> packets through the kernel instead of just accessing them directly on
>> the NIC
>>>
>>> That's my two cents.
>>>
>>> Bob
>>>
>>> Gary Faulkner <gary at doit.wisc.edu> wrote:
>>>> I've had some proxy crashes in the past and it was suggested that I
>>>> increase my number of proxies -- which I did until my environment
>>>> appeared stable for about a week. After being stable for about a
>> week I
>>>>
>>>> started to run out of memory, and in subsequent restarts have been
>>>> running out of memory after about 24 hours of operation, typically
>>>> during non-peak times (50% of normal traffic). Naturally I'm
>> wondering
>>>> if I'm just doing it wrong and if my set-up is appropriately sized
>> and
>>>> configured to handle the load I'm asking it to deal with.
>>>>
>>>> I think I've seen folks on the list that were running Bro on similar
>>>> hardware that might be able to tell me if my configuration is
>> anything
>>>> close to what works for them. I'm also curious how other folks
>>>> determine
>>>> how many proxies they need, how many workers per host etc.
>>>>
>>>> I'm mostly running Bro 2.2 stock with default scripts, and only
>> minor
>>>> edits to local.bro to test out email notices. I'm only using these
>>>> systems for Bro, although they were originally from another project
>> so
>>>> they weren't necessarily ordered with Bro specs in mind.
>>>>
>>>> Here's how I've got things allocated currently:
>>>>
>>>> Bro -1 Host:
>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>> 64G RAM
>>>> manager
>>>> 2 proxies
>>>> 20 workers
>>>> 2-4GB of Traffic
>>>>
>>>> Bro-2 Host:
>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>> 64G RAM
>>>> 2 proxies
>>>> 20 workers
>>>> 2-4 GB of traffic
>>>>
>>>> The following is a relatively light traffic load (late on a Friday)
>> for
>>>>
>>>> my install (4Gbps vs 8Gbps):
>>>>
>>>> bro-1 $ ./broctl capstats
>>>>
>>>> Interface            kpps       mbps       (10s average)
>>>> ------------------------------
>>>> 192.168.0.10/dnacluster:21 338.6      2327.4
>>>> 192.168.0.11/dnacluster:22 324.8      2264.7
>>>>
>>>> Total                663.4      4592.1
>>>>
>>>> bro-1 $ ./broctl top
>>>> Name       Type       Node       Pid      Proc     VSize Rss
>>>> Cpu      Cmd
>>>> manager    manager    192.168.0.10 14816    parent     2G 736M
>>>> 88%      bro
>>>> manager    manager    192.168.0.10 14817    child    169M 93M
>>>> 44%      bro
>>>> proxy-1    proxy      192.168.0.10 14863    child    102M 26M
>>>> 23%      bro
>>>> proxy-1    proxy      192.168.0.10 14860    parent     1G 1G
>>>> 3%       bro
>>>> proxy-2    proxy      192.168.0.10 14862    child    102M 28M
>>>> 27%      bro
>>>> proxy-2    proxy      192.168.0.10 14861    parent     1G 1G
>>>> 3%       bro
>>>> proxy-3    proxy      192.168.0.11 28900    child    102M 46M
>>>> 20%      bro
>>>> proxy-3    proxy      192.168.0.11 28898    parent     1G 1G
>>>> 1%       bro
>>>> proxy-4    proxy      192.168.0.11 28899    child    102M 45M
>>>> 21%      bro
>>>> proxy-4    proxy      192.168.0.11 28897    parent     1G 1G
>>>> 1%       bro
>>>> worker-1-1 worker     192.168.0.10 15228    parent     2G 2G
>>>> 65%      bro
>>>> worker-1-1 worker     192.168.0.10 15398    child    514M 11M
>>>> 10%      bro
>>>> worker-1-10 worker     192.168.0.10 15230    parent     2G 2G
>>>> 53%      bro
>>>> worker-1-10 worker     192.168.0.10 15407    child    514M 12M
>>>> 8%       bro
>>>> worker-1-11 worker     192.168.0.10 15234    parent     2G 2G
>>>> 78%      bro
>>>> worker-1-11 worker     192.168.0.10 15286    child    514M 9M
>>>> 11%      bro
>>>> worker-1-12 worker     192.168.0.10 15235    parent     2G 2G
>>>> 67%      bro
>>>> worker-1-12 worker     192.168.0.10 15267    child    514M 8M
>>>> 12%      bro
>>>> worker-1-13 worker     192.168.0.10 15237    parent     2G 2G
>>>> 82%      bro
>>>> worker-1-13 worker     192.168.0.10 15392    child    514M 9M
>>>> 12%      bro
>>>> worker-1-14 worker     192.168.0.10 15238    parent     2G 2G
>>>> 43%      bro
>>>> worker-1-14 worker     192.168.0.10 15264    child    514M 11M
>>>> 8%       bro
>>>> worker-1-15 worker     192.168.0.10 15240    parent     2G 2G
>>>> 76%      bro
>>>> worker-1-15 worker     192.168.0.10 15300    child    514M 7M
>>>> 9%       bro
>>>> worker-1-16 worker     192.168.0.10 15243    parent     2G 2G
>>>> 94%      bro
>>>> worker-1-16 worker     192.168.0.10 15404    child    514M 11M
>>>> 9%       bro
>>>> worker-1-17 worker     192.168.0.10 15244    parent     2G 2G
>>>> 67%      bro
>>>> worker-1-17 worker     192.168.0.10 15383    child    514M 8M
>>>> 8%       bro
>>>> worker-1-18 worker     192.168.0.10 15246    parent     2G 2G
>>>> 80%      bro
>>>> worker-1-18 worker     192.168.0.10 15372    child    514M 12M
>>>> 11%      bro
>>>> worker-1-19 worker     192.168.0.10 15248    parent     2G 2G
>>>> 76%      bro
>>>> worker-1-19 worker     192.168.0.10 15376    child    514M 8M
>>>> 8%       bro
>>>> worker-1-2 worker     192.168.0.10 15251    parent     2G 2G
>>>> 83%      bro
>>>> worker-1-2 worker     192.168.0.10 15414    child    514M 11M
>>>> 10%      bro
>>>> worker-1-20 worker     192.168.0.10 15254    parent     2G 2G
>>>> 86%      bro
>>>> worker-1-20 worker     192.168.0.10 15417    child    514M 12M
>>>> 11%      bro
>>>> worker-1-3 worker     192.168.0.10 15253    parent     2G 2G
>>>> 55%      bro
>>>> worker-1-3 worker     192.168.0.10 15375    child    514M 8M
>>>> 12%      bro
>>>> worker-1-4 worker     192.168.0.10 15256    parent     2G 2G
>>>> 87%      bro
>>>> worker-1-4 worker     192.168.0.10 15388    child    515M 8M
>>>> 10%      bro
>>>> worker-1-5 worker     192.168.0.10 15257    parent     2G 2G
>>>> 58%      bro
>>>> worker-1-5 worker     192.168.0.10 15395    child    515M 11M
>>>> 10%      bro
>>>> worker-1-6 worker     192.168.0.10 15258    parent     2G 2G
>>>> 96%      bro
>>>> worker-1-6 worker     192.168.0.10 15394    child    514M 11M
>>>> 8%       bro
>>>> worker-1-7 worker     192.168.0.10 15259    parent     2G 2G
>>>> 65%      bro
>>>> worker-1-7 worker     192.168.0.10 15413    child    514M 12M
>>>> 6%       bro
>>>> worker-1-8 worker     192.168.0.10 15260    parent     2G 2G
>>>> 99%      bro
>>>> worker-1-8 worker     192.168.0.10 15401    child    514M 11M
>>>> 8%       bro
>>>> worker-1-9 worker     192.168.0.10 15261    parent     2G 2G
>>>> 61%      bro
>>>> worker-1-9 worker     192.168.0.10 15408    child    514M 11M
>>>> 8%       bro
>>>> worker-2-1 worker     192.168.0.11 29961    parent     2G 2G
>>>> 85%      bro
>>>> worker-2-1 worker     192.168.0.11 29984    child    514M 31M
>>>> 9%       bro
>>>> worker-2-10 worker     192.168.0.11 29959    parent     2G 2G
>>>> 52%      bro
>>>> worker-2-10 worker     192.168.0.11 30085    child    515M 31M
>>>> 8%       bro
>>>> worker-2-11 worker     192.168.0.11 29960    parent     2G 2G
>>>> 96%      bro
>>>> worker-2-11 worker     192.168.0.11 30112    child    514M 31M
>>>> 10%      bro
>>>> worker-2-12 worker     192.168.0.11 29973    parent     2G 2G
>>>> 54%      bro
>>>> worker-2-12 worker     192.168.0.11 30082    child    514M 30M
>>>> 8%       bro
>>>> worker-2-13 worker     192.168.0.11 29967    parent     2G 2G
>>>> 93%      bro
>>>> worker-2-13 worker     192.168.0.11 30111    child    514M 31M
>>>> 10%      bro
>>>> worker-2-14 worker     192.168.0.11 29962    parent     2G 2G
>>>> 100%     bro
>>>> worker-2-14 worker     192.168.0.11 30076    child    514M 30M
>>>> 8%       bro
>>>> worker-2-15 worker     192.168.0.11 29975    parent     2G 2G
>>>> 55%      bro
>>>> worker-2-15 worker     192.168.0.11 30138    child    514M 31M
>>>> 10%      bro
>>>> worker-2-16 worker     192.168.0.11 29965    parent     2G 2G
>>>> 85%      bro
>>>> worker-2-16 worker     192.168.0.11 29994    child    514M 31M
>>>> 8%       bro
>>>> worker-2-17 worker     192.168.0.11 29968    parent     2G 2G
>>>> 76%      bro
>>>> worker-2-17 worker     192.168.0.11 30097    child    514M 31M
>>>> 8%       bro
>>>> worker-2-18 worker     192.168.0.11 29972    parent     2G 2G
>>>> 95%      bro
>>>> worker-2-18 worker     192.168.0.11 30115    child    514M 30M
>>>> 10%      bro
>>>> worker-2-19 worker     192.168.0.11 29964    parent     2G 2G
>>>> 68%      bro
>>>> worker-2-19 worker     192.168.0.11 30092    child    514M 31M
>>>> 7%       bro
>>>> worker-2-2 worker     192.168.0.11 29974    parent     2G 2G
>>>> 51%      bro
>>>> worker-2-2 worker     192.168.0.11 30133    child    514M 31M
>>>> 7%       bro
>>>> worker-2-20 worker     192.168.0.11 29966    parent     2G 2G
>>>> 59%      bro
>>>> worker-2-20 worker     192.168.0.11 29981    child    514M 30M
>>>> 10%      bro
>>>> worker-2-3 worker     192.168.0.11 29969    parent     2G 2G
>>>> 95%      bro
>>>> worker-2-3 worker     192.168.0.11 30095    child    514M 31M
>>>> 8%       bro
>>>> worker-2-4 worker     192.168.0.11 29970    parent     2G 2G
>>>> 95%      bro
>>>> worker-2-4 worker     192.168.0.11 30137    child    514M 30M
>>>> 8%       bro
>>>> worker-2-5 worker     192.168.0.11 29977    parent     2G 2G
>>>> 84%      bro
>>>> worker-2-5 worker     192.168.0.11 30100    child    514M 31M
>>>> 10%      bro
>>>> worker-2-6 worker     192.168.0.11 29978    parent     2G 2G
>>>> 73%      bro
>>>> worker-2-6 worker     192.168.0.11 29990    child    514M 30M
>>>> 8%       bro
>>>> worker-2-7 worker     192.168.0.11 29976    parent     2G 2G
>>>> 76%      bro
>>>> worker-2-7 worker     192.168.0.11 30081    child    514M 31M
>>>> 10%      bro
>>>> worker-2-8 worker     192.168.0.11 29963    parent     2G 2G
>>>> 57%      bro
>>>> worker-2-8 worker     192.168.0.11 29987    child    514M 30M
>>>> 8%       bro
>>>> worker-2-9 worker     192.168.0.11 29971    parent     2G 2G
>>>> 52%      bro
>>>> worker-2-9 worker     192.168.0.11 30096    child    514M 31M
>>>> 10%      bro
>>>>
>>>> bro-1 $ free -g
>>>>                total       used       free     shared    buffers
>> cached
>>>> Mem:            62         62          0          0 0         17
>>>> -/+ buffers/cache:         44         17
>>>> Swap:            0          0          0
>>>>
>>>> bro-2 $ free -g
>>>>                total       used       free     shared    buffers
>> cached
>>>> Mem:            62         45         17          0 0          1
>>>> -/+ buffers/cache:         44         18
>>>> Swap:            0          0          0
>>>>
>>>> What do you guys think?
>>>>
>>>> Regards,
>>>> Gary
>>>>
>>>> PS ~
>>>>
>>>> I've been reading the mailing list archives and it seems that folks
>>>> with
>>>> the older Xeons with higher clock rates (3.4Ghzish), but fewer cores
>>>> were able to handle upwards of 400-500Mbps per worker process. I've
>>>> also
>>>> seen it hinted, I think by Vlad G., that he was fitting in 28
>> workers
>>>> on
>>>> boxes with similar core counts to my own, but slightly faster
>>>> processors. Based on some of those remarks in previous threads I was
>>>> thinking I should be able to handle a little over 300Mbps per
>> process
>>>> with these processors, but I've only had the traffic to push about
>>>> 200Mbps per worker so far.
>>>>
>>>> I know some folks also like to put the manager and possibly the
>> proxies
>>>>
>>>> on separate boxes from the workers, but I haven't gotten a good
>> sense
>>>> as
>>>> to what kind of workload a proxy can handle. As far as proxies I've
>>>> mostly seen comments such as "I probably have way more proxies than
>> I
>>>> need" or "Just keep adding proxies until they stop crashing".  I
>> don't
>>>> currently have a spare box for the manager and proxy, but would be
>>>> curious to know if folks feel it is a necessity. My observations on
>> my
>>>> own setup are that my Bro workers typically are using 99% of a
>> logical
>>>> core at peak network times, and my manager 150-175%
>> (multi-threaded).
>>>> My
>>>> workers seem to use about 1-2G of memory normally.
>>>>
>>>>
>>>>
>>>>
>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Bro mailing list
>>>> bro at bro-ids.org
>>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>



More information about the Bro mailing list