[Bro] Sizing a Bro Cluster [was out of memory after a couple days]

Gary Faulkner gary at doit.wisc.edu
Sat Dec 7 11:08:32 PST 2013


Bob,

Thank you. I'll try out your suggestions and monitor performance.

Regards,
Gary

On 12/6/2013 9:12 PM, Bob wrote:
> Twelve per box would be my plan.  And to be honest, once you lower the number of workers, you could probably reduce the number of proxies as well.  I'm running 3 for a cluster with 72 active workers, so you could probably do 24 with just one (and you could even put it on the non-manager box to maintain some symmetry).  As always though, feel free to experiment with that to find what works best for you.
>
> Bob
>
> Gary Faulkner <gary at doit.wisc.edu> wrote:
>> Bob,
>>
>> So, probably closer to 10-12 workers per host in my case since the
>> proxies and manager are there? It was suggested to me early on to try
>> to
>> acquire another box to separate from manager and proxies, but I don't
>> have one quite yet so I've been trying to make it work as is. It didn't
>>
>> seem like the worker child processes needed much CPU time, so I thought
>>
>> I could push the worker count higher and it also seemed like I got less
>>
>> loss per broctl netstat, but others have suggested that maybe the
>> broctl
>> netstat command wasn't the most reliable way to judge that. I ended up
>> at 4 proxies mostly because two didn't seem stable and I like symmetry
>> so I jumped straight to 4.
>>
>> Thanks again,
>> Gary
>>
>> On 12/6/2013 7:57 PM, Bob wrote:
>>> Gary,
>>>
>>> Ah, my apologies for my misreading of your specs.  Two hosts seeing
>> 2-4Gb/s each should be just fine.
>>>
>>> As far as my recommendation, I definitely meant real cores.  I find,
>> based on my own experimentation, that taking your number of workers
>> higher than the number of physical cores in the box tends to hurt the
>> performance instead of helping.  In the case where a box was also doing
>> manager/proxy duty, I'd subtract a physical core for each of those as
>> well (I've broken those out into a separate box to avoid that issue on
>> mine).  I don't recall seeing anything from Vlad suggesting going
>> higher, but I can't rule out the possibility that I missed it.
>>>
>>> Also to answer a question that I overlooked earlier, I'm running 3
>> proxies on that workload.  I got to that number by just increasing by
>> one until it stabilized.
>>>
>>> Bob
>>>
>>> Gary Faulkner <gary at doit.wisc.edu> wrote:
>>>> Bob,
>>>>
>>>> Thank you for the feedback. To clarify I am currently using two
>>>> physical
>>>> hosts clustered together (using broctl), so each box ends up with
>> 64G
>>>> of
>>>> RAM and 16cores/32 threads and Intel IXGBE 10G Cards +
>>>> pf_ring/DNA/libzero for distributing packets on the host. Each
>> physical
>>>>
>>>> host then sees between 2-4Gbps and has 20 workers + 2 proxies. I
>> recall
>>>>
>>>> reading a blog entry by Martin Holste where he mentioned allocating
>>>> only
>>>> half as many workers as you had logical cores/threads, but also seem
>> to
>>>>
>>>> recall others (Vlad G.) pushing nearly as many workers as logical
>>>> cores,
>>>> but could have read to much into it. Are you referring to physical
>>>> cores
>>>> or logical cores/threads? If it is the former I think what you are
>>>> saying is inline with what Martin suggested; although I had hoped I
>>>> could push the worker count a bit higher based on what I thought I
>> had
>>>> read elsewhere.
>>>>
>>>> Regards,
>>>> Gary
>>>>
>>>>    On 12/6/2013 7:08 PM, Bob wrote:
>>>>> Gary,
>>>>>
>>>>> Realistically, for that load I would recommend looking into a
>>>> cluster.  My personal sizing criteria is 4-5Gb/s max per box.
>> That's
>>>> for a box that's roughly the same as yours and that uses up about
>>>> 40-50GB of RAM per box (although for as cheap as it is, I recommend
>> to
>>>> always guess high on the RAM).  For this size of box, you can
>> probably
>>>> improve your performance by reducing the number of workers (one per
>>>> real core is a good benchmark).  I am conservative, so I like to
>> keep a
>>>> couple if cores free for system tasks to ensure reliable performance
>>>> (setting the number of workers to something like 12 or 14), but
>> that's
>>>> up to you.
>>>>>
>>>>> As a caveat, the aforementioned recommendations assume that you're
>>>> using a network card that's designed to do this work (like an Intel
>>>> ixgbe card with pf_ring or a Myricom).  If your card can't bypass
>> the
>>>> kernel's interfaces, then you're going to need a lot more hardware
>> to
>>>> get the same performance because you're spending CPU time shoving
>> the
>>>> packets through the kernel instead of just accessing them directly
>> on
>>>> the NIC
>>>>>
>>>>> That's my two cents.
>>>>>
>>>>> Bob
>>>>>
>>>>> Gary Faulkner <gary at doit.wisc.edu> wrote:
>>>>>> I've had some proxy crashes in the past and it was suggested that
>> I
>>>>>> increase my number of proxies -- which I did until my environment
>>>>>> appeared stable for about a week. After being stable for about a
>>>> week I
>>>>>>
>>>>>> started to run out of memory, and in subsequent restarts have been
>>>>>> running out of memory after about 24 hours of operation, typically
>>>>>> during non-peak times (50% of normal traffic). Naturally I'm
>>>> wondering
>>>>>> if I'm just doing it wrong and if my set-up is appropriately sized
>>>> and
>>>>>> configured to handle the load I'm asking it to deal with.
>>>>>>
>>>>>> I think I've seen folks on the list that were running Bro on
>> similar
>>>>>> hardware that might be able to tell me if my configuration is
>>>> anything
>>>>>> close to what works for them. I'm also curious how other folks
>>>>>> determine
>>>>>> how many proxies they need, how many workers per host etc.
>>>>>>
>>>>>> I'm mostly running Bro 2.2 stock with default scripts, and only
>>>> minor
>>>>>> edits to local.bro to test out email notices. I'm only using these
>>>>>> systems for Bro, although they were originally from another
>> project
>>>> so
>>>>>> they weren't necessarily ordered with Bro specs in mind.
>>>>>>
>>>>>> Here's how I've got things allocated currently:
>>>>>>
>>>>>> Bro -1 Host:
>>>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>>>> 64G RAM
>>>>>> manager
>>>>>> 2 proxies
>>>>>> 20 workers
>>>>>> 2-4GB of Traffic
>>>>>>
>>>>>> Bro-2 Host:
>>>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>>>> 64G RAM
>>>>>> 2 proxies
>>>>>> 20 workers
>>>>>> 2-4 GB of traffic
>>>>>>
>>>>>> The following is a relatively light traffic load (late on a
>> Friday)
>>>> for
>>>>>>
>>>>>> my install (4Gbps vs 8Gbps):
>>>>>>
>>>>>> bro-1 $ ./broctl capstats
>>>>>>
>>>>>> Interface            kpps       mbps       (10s average)
>>>>>> ------------------------------
>>>>>> 192.168.0.10/dnacluster:21 338.6      2327.4
>>>>>> 192.168.0.11/dnacluster:22 324.8      2264.7
>>>>>>
>>>>>> Total                663.4      4592.1
>>>>>>
>>>>>> bro-1 $ ./broctl top
>>>>>> Name       Type       Node       Pid      Proc     VSize Rss
>>>>>> Cpu      Cmd
>>>>>> manager    manager    192.168.0.10 14816    parent     2G 736M
>>>>>> 88%      bro
>>>>>> manager    manager    192.168.0.10 14817    child    169M 93M
>>>>>> 44%      bro
>>>>>> proxy-1    proxy      192.168.0.10 14863    child    102M 26M
>>>>>> 23%      bro
>>>>>> proxy-1    proxy      192.168.0.10 14860    parent     1G 1G
>>>>>> 3%       bro
>>>>>> proxy-2    proxy      192.168.0.10 14862    child    102M 28M
>>>>>> 27%      bro
>>>>>> proxy-2    proxy      192.168.0.10 14861    parent     1G 1G
>>>>>> 3%       bro
>>>>>> proxy-3    proxy      192.168.0.11 28900    child    102M 46M
>>>>>> 20%      bro
>>>>>> proxy-3    proxy      192.168.0.11 28898    parent     1G 1G
>>>>>> 1%       bro
>>>>>> proxy-4    proxy      192.168.0.11 28899    child    102M 45M
>>>>>> 21%      bro
>>>>>> proxy-4    proxy      192.168.0.11 28897    parent     1G 1G
>>>>>> 1%       bro
>>>>>> worker-1-1 worker     192.168.0.10 15228    parent     2G 2G
>>>>>> 65%      bro
>>>>>> worker-1-1 worker     192.168.0.10 15398    child    514M 11M
>>>>>> 10%      bro
>>>>>> worker-1-10 worker     192.168.0.10 15230    parent     2G 2G
>>>>>> 53%      bro
>>>>>> worker-1-10 worker     192.168.0.10 15407    child    514M 12M
>>>>>> 8%       bro
>>>>>> worker-1-11 worker     192.168.0.10 15234    parent     2G 2G
>>>>>> 78%      bro
>>>>>> worker-1-11 worker     192.168.0.10 15286    child    514M 9M
>>>>>> 11%      bro
>>>>>> worker-1-12 worker     192.168.0.10 15235    parent     2G 2G
>>>>>> 67%      bro
>>>>>> worker-1-12 worker     192.168.0.10 15267    child    514M 8M
>>>>>> 12%      bro
>>>>>> worker-1-13 worker     192.168.0.10 15237    parent     2G 2G
>>>>>> 82%      bro
>>>>>> worker-1-13 worker     192.168.0.10 15392    child    514M 9M
>>>>>> 12%      bro
>>>>>> worker-1-14 worker     192.168.0.10 15238    parent     2G 2G
>>>>>> 43%      bro
>>>>>> worker-1-14 worker     192.168.0.10 15264    child    514M 11M
>>>>>> 8%       bro
>>>>>> worker-1-15 worker     192.168.0.10 15240    parent     2G 2G
>>>>>> 76%      bro
>>>>>> worker-1-15 worker     192.168.0.10 15300    child    514M 7M
>>>>>> 9%       bro
>>>>>> worker-1-16 worker     192.168.0.10 15243    parent     2G 2G
>>>>>> 94%      bro
>>>>>> worker-1-16 worker     192.168.0.10 15404    child    514M 11M
>>>>>> 9%       bro
>>>>>> worker-1-17 worker     192.168.0.10 15244    parent     2G 2G
>>>>>> 67%      bro
>>>>>> worker-1-17 worker     192.168.0.10 15383    child    514M 8M
>>>>>> 8%       bro
>>>>>> worker-1-18 worker     192.168.0.10 15246    parent     2G 2G
>>>>>> 80%      bro
>>>>>> worker-1-18 worker     192.168.0.10 15372    child    514M 12M
>>>>>> 11%      bro
>>>>>> worker-1-19 worker     192.168.0.10 15248    parent     2G 2G
>>>>>> 76%      bro
>>>>>> worker-1-19 worker     192.168.0.10 15376    child    514M 8M
>>>>>> 8%       bro
>>>>>> worker-1-2 worker     192.168.0.10 15251    parent     2G 2G
>>>>>> 83%      bro
>>>>>> worker-1-2 worker     192.168.0.10 15414    child    514M 11M
>>>>>> 10%      bro
>>>>>> worker-1-20 worker     192.168.0.10 15254    parent     2G 2G
>>>>>> 86%      bro
>>>>>> worker-1-20 worker     192.168.0.10 15417    child    514M 12M
>>>>>> 11%      bro
>>>>>> worker-1-3 worker     192.168.0.10 15253    parent     2G 2G
>>>>>> 55%      bro
>>>>>> worker-1-3 worker     192.168.0.10 15375    child    514M 8M
>>>>>> 12%      bro
>>>>>> worker-1-4 worker     192.168.0.10 15256    parent     2G 2G
>>>>>> 87%      bro
>>>>>> worker-1-4 worker     192.168.0.10 15388    child    515M 8M
>>>>>> 10%      bro
>>>>>> worker-1-5 worker     192.168.0.10 15257    parent     2G 2G
>>>>>> 58%      bro
>>>>>> worker-1-5 worker     192.168.0.10 15395    child    515M 11M
>>>>>> 10%      bro
>>>>>> worker-1-6 worker     192.168.0.10 15258    parent     2G 2G
>>>>>> 96%      bro
>>>>>> worker-1-6 worker     192.168.0.10 15394    child    514M 11M
>>>>>> 8%       bro
>>>>>> worker-1-7 worker     192.168.0.10 15259    parent     2G 2G
>>>>>> 65%      bro
>>>>>> worker-1-7 worker     192.168.0.10 15413    child    514M 12M
>>>>>> 6%       bro
>>>>>> worker-1-8 worker     192.168.0.10 15260    parent     2G 2G
>>>>>> 99%      bro
>>>>>> worker-1-8 worker     192.168.0.10 15401    child    514M 11M
>>>>>> 8%       bro
>>>>>> worker-1-9 worker     192.168.0.10 15261    parent     2G 2G
>>>>>> 61%      bro
>>>>>> worker-1-9 worker     192.168.0.10 15408    child    514M 11M
>>>>>> 8%       bro
>>>>>> worker-2-1 worker     192.168.0.11 29961    parent     2G 2G
>>>>>> 85%      bro
>>>>>> worker-2-1 worker     192.168.0.11 29984    child    514M 31M
>>>>>> 9%       bro
>>>>>> worker-2-10 worker     192.168.0.11 29959    parent     2G 2G
>>>>>> 52%      bro
>>>>>> worker-2-10 worker     192.168.0.11 30085    child    515M 31M
>>>>>> 8%       bro
>>>>>> worker-2-11 worker     192.168.0.11 29960    parent     2G 2G
>>>>>> 96%      bro
>>>>>> worker-2-11 worker     192.168.0.11 30112    child    514M 31M
>>>>>> 10%      bro
>>>>>> worker-2-12 worker     192.168.0.11 29973    parent     2G 2G
>>>>>> 54%      bro
>>>>>> worker-2-12 worker     192.168.0.11 30082    child    514M 30M
>>>>>> 8%       bro
>>>>>> worker-2-13 worker     192.168.0.11 29967    parent     2G 2G
>>>>>> 93%      bro
>>>>>> worker-2-13 worker     192.168.0.11 30111    child    514M 31M
>>>>>> 10%      bro
>>>>>> worker-2-14 worker     192.168.0.11 29962    parent     2G 2G
>>>>>> 100%     bro
>>>>>> worker-2-14 worker     192.168.0.11 30076    child    514M 30M
>>>>>> 8%       bro
>>>>>> worker-2-15 worker     192.168.0.11 29975    parent     2G 2G
>>>>>> 55%      bro
>>>>>> worker-2-15 worker     192.168.0.11 30138    child    514M 31M
>>>>>> 10%      bro
>>>>>> worker-2-16 worker     192.168.0.11 29965    parent     2G 2G
>>>>>> 85%      bro
>>>>>> worker-2-16 worker     192.168.0.11 29994    child    514M 31M
>>>>>> 8%       bro
>>>>>> worker-2-17 worker     192.168.0.11 29968    parent     2G 2G
>>>>>> 76%      bro
>>>>>> worker-2-17 worker     192.168.0.11 30097    child    514M 31M
>>>>>> 8%       bro
>>>>>> worker-2-18 worker     192.168.0.11 29972    parent     2G 2G
>>>>>> 95%      bro
>>>>>> worker-2-18 worker     192.168.0.11 30115    child    514M 30M
>>>>>> 10%      bro
>>>>>> worker-2-19 worker     192.168.0.11 29964    parent     2G 2G
>>>>>> 68%      bro
>>>>>> worker-2-19 worker     192.168.0.11 30092    child    514M 31M
>>>>>> 7%       bro
>>>>>> worker-2-2 worker     192.168.0.11 29974    parent     2G 2G
>>>>>> 51%      bro
>>>>>> worker-2-2 worker     192.168.0.11 30133    child    514M 31M
>>>>>> 7%       bro
>>>>>> worker-2-20 worker     192.168.0.11 29966    parent     2G 2G
>>>>>> 59%      bro
>>>>>> worker-2-20 worker     192.168.0.11 29981    child    514M 30M
>>>>>> 10%      bro
>>>>>> worker-2-3 worker     192.168.0.11 29969    parent     2G 2G
>>>>>> 95%      bro
>>>>>> worker-2-3 worker     192.168.0.11 30095    child    514M 31M
>>>>>> 8%       bro
>>>>>> worker-2-4 worker     192.168.0.11 29970    parent     2G 2G
>>>>>> 95%      bro
>>>>>> worker-2-4 worker     192.168.0.11 30137    child    514M 30M
>>>>>> 8%       bro
>>>>>> worker-2-5 worker     192.168.0.11 29977    parent     2G 2G
>>>>>> 84%      bro
>>>>>> worker-2-5 worker     192.168.0.11 30100    child    514M 31M
>>>>>> 10%      bro
>>>>>> worker-2-6 worker     192.168.0.11 29978    parent     2G 2G
>>>>>> 73%      bro
>>>>>> worker-2-6 worker     192.168.0.11 29990    child    514M 30M
>>>>>> 8%       bro
>>>>>> worker-2-7 worker     192.168.0.11 29976    parent     2G 2G
>>>>>> 76%      bro
>>>>>> worker-2-7 worker     192.168.0.11 30081    child    514M 31M
>>>>>> 10%      bro
>>>>>> worker-2-8 worker     192.168.0.11 29963    parent     2G 2G
>>>>>> 57%      bro
>>>>>> worker-2-8 worker     192.168.0.11 29987    child    514M 30M
>>>>>> 8%       bro
>>>>>> worker-2-9 worker     192.168.0.11 29971    parent     2G 2G
>>>>>> 52%      bro
>>>>>> worker-2-9 worker     192.168.0.11 30096    child    514M 31M
>>>>>> 10%      bro
>>>>>>
>>>>>> bro-1 $ free -g
>>>>>>                 total       used       free     shared    buffers
>>>> cached
>>>>>> Mem:            62         62          0          0 0         17
>>>>>> -/+ buffers/cache:         44         17
>>>>>> Swap:            0          0          0
>>>>>>
>>>>>> bro-2 $ free -g
>>>>>>                 total       used       free     shared    buffers
>>>> cached
>>>>>> Mem:            62         45         17          0 0          1
>>>>>> -/+ buffers/cache:         44         18
>>>>>> Swap:            0          0          0
>>>>>>
>>>>>> What do you guys think?
>>>>>>
>>>>>> Regards,
>>>>>> Gary
>>>>>>
>>>>>> PS ~
>>>>>>
>>>>>> I've been reading the mailing list archives and it seems that
>> folks
>>>>>> with
>>>>>> the older Xeons with higher clock rates (3.4Ghzish), but fewer
>> cores
>>>>>> were able to handle upwards of 400-500Mbps per worker process.
>> I've
>>>>>> also
>>>>>> seen it hinted, I think by Vlad G., that he was fitting in 28
>>>> workers
>>>>>> on
>>>>>> boxes with similar core counts to my own, but slightly faster
>>>>>> processors. Based on some of those remarks in previous threads I
>> was
>>>>>> thinking I should be able to handle a little over 300Mbps per
>>>> process
>>>>>> with these processors, but I've only had the traffic to push about
>>>>>> 200Mbps per worker so far.
>>>>>>
>>>>>> I know some folks also like to put the manager and possibly the
>>>> proxies
>>>>>>
>>>>>> on separate boxes from the workers, but I haven't gotten a good
>>>> sense
>>>>>> as
>>>>>> to what kind of workload a proxy can handle. As far as proxies
>> I've
>>>>>> mostly seen comments such as "I probably have way more proxies
>> than
>>>> I
>>>>>> need" or "Just keep adding proxies until they stop crashing".  I
>>>> don't
>>>>>> currently have a spare box for the manager and proxy, but would be
>>>>>> curious to know if folks feel it is a necessity. My observations
>> on
>>>> my
>>>>>> own setup are that my Bro workers typically are using 99% of a
>>>> logical
>>>>>> core at peak network times, and my manager 150-175%
>>>> (multi-threaded).
>>>>>> My
>>>>>> workers seem to use about 1-2G of memory normally.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>> ------------------------------------------------------------------------
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bro mailing list
>>>>>> bro at bro-ids.org
>>>>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>>>
>



More information about the Bro mailing list