[Bro] Sizing a Bro Cluster [was out of memory after a couple days]

Gary Faulkner gary at doit.wisc.edu
Fri Dec 6 17:37:12 PST 2013


Bob,

Thank you for the feedback. To clarify I am currently using two physical 
hosts clustered together (using broctl), so each box ends up with 64G of 
RAM and 16cores/32 threads and Intel IXGBE 10G Cards + 
pf_ring/DNA/libzero for distributing packets on the host. Each physical 
host then sees between 2-4Gbps and has 20 workers + 2 proxies. I recall 
reading a blog entry by Martin Holste where he mentioned allocating only 
half as many workers as you had logical cores/threads, but also seem to 
recall others (Vlad G.) pushing nearly as many workers as logical cores, 
but could have read to much into it. Are you referring to physical cores 
or logical cores/threads? If it is the former I think what you are 
saying is inline with what Martin suggested; although I had hoped I 
could push the worker count a bit higher based on what I thought I had 
read elsewhere.

Regards,
Gary

  On 12/6/2013 7:08 PM, Bob wrote:
> Gary,
>
> Realistically, for that load I would recommend looking into a cluster.  My personal sizing criteria is 4-5Gb/s max per box.  That's for a box that's roughly the same as yours and that uses up about 40-50GB of RAM per box (although for as cheap as it is, I recommend to always guess high on the RAM).  For this size of box, you can probably improve your performance by reducing the number of workers (one per real core is a good benchmark).  I am conservative, so I like to keep a couple if cores free for system tasks to ensure reliable performance (setting the number of workers to something like 12 or 14), but that's up to you.
>
> As a caveat, the aforementioned recommendations assume that you're using a network card that's designed to do this work (like an Intel ixgbe card with pf_ring or a Myricom).  If your card can't bypass the kernel's interfaces, then you're going to need a lot more hardware to get the same performance because you're spending CPU time shoving the packets through the kernel instead of just accessing them directly on the NIC
>
> That's my two cents.
>
> Bob
>
> Gary Faulkner <gary at doit.wisc.edu> wrote:
>> I've had some proxy crashes in the past and it was suggested that I
>> increase my number of proxies -- which I did until my environment
>> appeared stable for about a week. After being stable for about a week I
>>
>> started to run out of memory, and in subsequent restarts have been
>> running out of memory after about 24 hours of operation, typically
>> during non-peak times (50% of normal traffic). Naturally I'm wondering
>> if I'm just doing it wrong and if my set-up is appropriately sized and
>> configured to handle the load I'm asking it to deal with.
>>
>> I think I've seen folks on the list that were running Bro on similar
>> hardware that might be able to tell me if my configuration is anything
>> close to what works for them. I'm also curious how other folks
>> determine
>> how many proxies they need, how many workers per host etc.
>>
>> I'm mostly running Bro 2.2 stock with default scripts, and only minor
>> edits to local.bro to test out email notices. I'm only using these
>> systems for Bro, although they were originally from another project so
>> they weren't necessarily ordered with Bro specs in mind.
>>
>> Here's how I've got things allocated currently:
>>
>> Bro -1 Host:
>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>> 64G RAM
>> manager
>> 2 proxies
>> 20 workers
>> 2-4GB of Traffic
>>
>> Bro-2 Host:
>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>> 64G RAM
>> 2 proxies
>> 20 workers
>> 2-4 GB of traffic
>>
>> The following is a relatively light traffic load (late on a Friday) for
>>
>> my install (4Gbps vs 8Gbps):
>>
>> bro-1 $ ./broctl capstats
>>
>> Interface            kpps       mbps       (10s average)
>> ------------------------------
>> 192.168.0.10/dnacluster:21 338.6      2327.4
>> 192.168.0.11/dnacluster:22 324.8      2264.7
>>
>> Total                663.4      4592.1
>>
>> bro-1 $ ./broctl top
>> Name       Type       Node       Pid      Proc     VSize Rss
>> Cpu      Cmd
>> manager    manager    192.168.0.10 14816    parent     2G 736M
>> 88%      bro
>> manager    manager    192.168.0.10 14817    child    169M 93M
>> 44%      bro
>> proxy-1    proxy      192.168.0.10 14863    child    102M 26M
>> 23%      bro
>> proxy-1    proxy      192.168.0.10 14860    parent     1G 1G
>> 3%       bro
>> proxy-2    proxy      192.168.0.10 14862    child    102M 28M
>> 27%      bro
>> proxy-2    proxy      192.168.0.10 14861    parent     1G 1G
>> 3%       bro
>> proxy-3    proxy      192.168.0.11 28900    child    102M 46M
>> 20%      bro
>> proxy-3    proxy      192.168.0.11 28898    parent     1G 1G
>> 1%       bro
>> proxy-4    proxy      192.168.0.11 28899    child    102M 45M
>> 21%      bro
>> proxy-4    proxy      192.168.0.11 28897    parent     1G 1G
>> 1%       bro
>> worker-1-1 worker     192.168.0.10 15228    parent     2G 2G
>> 65%      bro
>> worker-1-1 worker     192.168.0.10 15398    child    514M 11M
>> 10%      bro
>> worker-1-10 worker     192.168.0.10 15230    parent     2G 2G
>> 53%      bro
>> worker-1-10 worker     192.168.0.10 15407    child    514M 12M
>> 8%       bro
>> worker-1-11 worker     192.168.0.10 15234    parent     2G 2G
>> 78%      bro
>> worker-1-11 worker     192.168.0.10 15286    child    514M 9M
>> 11%      bro
>> worker-1-12 worker     192.168.0.10 15235    parent     2G 2G
>> 67%      bro
>> worker-1-12 worker     192.168.0.10 15267    child    514M 8M
>> 12%      bro
>> worker-1-13 worker     192.168.0.10 15237    parent     2G 2G
>> 82%      bro
>> worker-1-13 worker     192.168.0.10 15392    child    514M 9M
>> 12%      bro
>> worker-1-14 worker     192.168.0.10 15238    parent     2G 2G
>> 43%      bro
>> worker-1-14 worker     192.168.0.10 15264    child    514M 11M
>> 8%       bro
>> worker-1-15 worker     192.168.0.10 15240    parent     2G 2G
>> 76%      bro
>> worker-1-15 worker     192.168.0.10 15300    child    514M 7M
>> 9%       bro
>> worker-1-16 worker     192.168.0.10 15243    parent     2G 2G
>> 94%      bro
>> worker-1-16 worker     192.168.0.10 15404    child    514M 11M
>> 9%       bro
>> worker-1-17 worker     192.168.0.10 15244    parent     2G 2G
>> 67%      bro
>> worker-1-17 worker     192.168.0.10 15383    child    514M 8M
>> 8%       bro
>> worker-1-18 worker     192.168.0.10 15246    parent     2G 2G
>> 80%      bro
>> worker-1-18 worker     192.168.0.10 15372    child    514M 12M
>> 11%      bro
>> worker-1-19 worker     192.168.0.10 15248    parent     2G 2G
>> 76%      bro
>> worker-1-19 worker     192.168.0.10 15376    child    514M 8M
>> 8%       bro
>> worker-1-2 worker     192.168.0.10 15251    parent     2G 2G
>> 83%      bro
>> worker-1-2 worker     192.168.0.10 15414    child    514M 11M
>> 10%      bro
>> worker-1-20 worker     192.168.0.10 15254    parent     2G 2G
>> 86%      bro
>> worker-1-20 worker     192.168.0.10 15417    child    514M 12M
>> 11%      bro
>> worker-1-3 worker     192.168.0.10 15253    parent     2G 2G
>> 55%      bro
>> worker-1-3 worker     192.168.0.10 15375    child    514M 8M
>> 12%      bro
>> worker-1-4 worker     192.168.0.10 15256    parent     2G 2G
>> 87%      bro
>> worker-1-4 worker     192.168.0.10 15388    child    515M 8M
>> 10%      bro
>> worker-1-5 worker     192.168.0.10 15257    parent     2G 2G
>> 58%      bro
>> worker-1-5 worker     192.168.0.10 15395    child    515M 11M
>> 10%      bro
>> worker-1-6 worker     192.168.0.10 15258    parent     2G 2G
>> 96%      bro
>> worker-1-6 worker     192.168.0.10 15394    child    514M 11M
>> 8%       bro
>> worker-1-7 worker     192.168.0.10 15259    parent     2G 2G
>> 65%      bro
>> worker-1-7 worker     192.168.0.10 15413    child    514M 12M
>> 6%       bro
>> worker-1-8 worker     192.168.0.10 15260    parent     2G 2G
>> 99%      bro
>> worker-1-8 worker     192.168.0.10 15401    child    514M 11M
>> 8%       bro
>> worker-1-9 worker     192.168.0.10 15261    parent     2G 2G
>> 61%      bro
>> worker-1-9 worker     192.168.0.10 15408    child    514M 11M
>> 8%       bro
>> worker-2-1 worker     192.168.0.11 29961    parent     2G 2G
>> 85%      bro
>> worker-2-1 worker     192.168.0.11 29984    child    514M 31M
>> 9%       bro
>> worker-2-10 worker     192.168.0.11 29959    parent     2G 2G
>> 52%      bro
>> worker-2-10 worker     192.168.0.11 30085    child    515M 31M
>> 8%       bro
>> worker-2-11 worker     192.168.0.11 29960    parent     2G 2G
>> 96%      bro
>> worker-2-11 worker     192.168.0.11 30112    child    514M 31M
>> 10%      bro
>> worker-2-12 worker     192.168.0.11 29973    parent     2G 2G
>> 54%      bro
>> worker-2-12 worker     192.168.0.11 30082    child    514M 30M
>> 8%       bro
>> worker-2-13 worker     192.168.0.11 29967    parent     2G 2G
>> 93%      bro
>> worker-2-13 worker     192.168.0.11 30111    child    514M 31M
>> 10%      bro
>> worker-2-14 worker     192.168.0.11 29962    parent     2G 2G
>> 100%     bro
>> worker-2-14 worker     192.168.0.11 30076    child    514M 30M
>> 8%       bro
>> worker-2-15 worker     192.168.0.11 29975    parent     2G 2G
>> 55%      bro
>> worker-2-15 worker     192.168.0.11 30138    child    514M 31M
>> 10%      bro
>> worker-2-16 worker     192.168.0.11 29965    parent     2G 2G
>> 85%      bro
>> worker-2-16 worker     192.168.0.11 29994    child    514M 31M
>> 8%       bro
>> worker-2-17 worker     192.168.0.11 29968    parent     2G 2G
>> 76%      bro
>> worker-2-17 worker     192.168.0.11 30097    child    514M 31M
>> 8%       bro
>> worker-2-18 worker     192.168.0.11 29972    parent     2G 2G
>> 95%      bro
>> worker-2-18 worker     192.168.0.11 30115    child    514M 30M
>> 10%      bro
>> worker-2-19 worker     192.168.0.11 29964    parent     2G 2G
>> 68%      bro
>> worker-2-19 worker     192.168.0.11 30092    child    514M 31M
>> 7%       bro
>> worker-2-2 worker     192.168.0.11 29974    parent     2G 2G
>> 51%      bro
>> worker-2-2 worker     192.168.0.11 30133    child    514M 31M
>> 7%       bro
>> worker-2-20 worker     192.168.0.11 29966    parent     2G 2G
>> 59%      bro
>> worker-2-20 worker     192.168.0.11 29981    child    514M 30M
>> 10%      bro
>> worker-2-3 worker     192.168.0.11 29969    parent     2G 2G
>> 95%      bro
>> worker-2-3 worker     192.168.0.11 30095    child    514M 31M
>> 8%       bro
>> worker-2-4 worker     192.168.0.11 29970    parent     2G 2G
>> 95%      bro
>> worker-2-4 worker     192.168.0.11 30137    child    514M 30M
>> 8%       bro
>> worker-2-5 worker     192.168.0.11 29977    parent     2G 2G
>> 84%      bro
>> worker-2-5 worker     192.168.0.11 30100    child    514M 31M
>> 10%      bro
>> worker-2-6 worker     192.168.0.11 29978    parent     2G 2G
>> 73%      bro
>> worker-2-6 worker     192.168.0.11 29990    child    514M 30M
>> 8%       bro
>> worker-2-7 worker     192.168.0.11 29976    parent     2G 2G
>> 76%      bro
>> worker-2-7 worker     192.168.0.11 30081    child    514M 31M
>> 10%      bro
>> worker-2-8 worker     192.168.0.11 29963    parent     2G 2G
>> 57%      bro
>> worker-2-8 worker     192.168.0.11 29987    child    514M 30M
>> 8%       bro
>> worker-2-9 worker     192.168.0.11 29971    parent     2G 2G
>> 52%      bro
>> worker-2-9 worker     192.168.0.11 30096    child    514M 31M
>> 10%      bro
>>
>> bro-1 $ free -g
>>               total       used       free     shared    buffers cached
>> Mem:            62         62          0          0 0         17
>> -/+ buffers/cache:         44         17
>> Swap:            0          0          0
>>
>> bro-2 $ free -g
>>               total       used       free     shared    buffers cached
>> Mem:            62         45         17          0 0          1
>> -/+ buffers/cache:         44         18
>> Swap:            0          0          0
>>
>> What do you guys think?
>>
>> Regards,
>> Gary
>>
>> PS ~
>>
>> I've been reading the mailing list archives and it seems that folks
>> with
>> the older Xeons with higher clock rates (3.4Ghzish), but fewer cores
>> were able to handle upwards of 400-500Mbps per worker process. I've
>> also
>> seen it hinted, I think by Vlad G., that he was fitting in 28 workers
>> on
>> boxes with similar core counts to my own, but slightly faster
>> processors. Based on some of those remarks in previous threads I was
>> thinking I should be able to handle a little over 300Mbps per process
>> with these processors, but I've only had the traffic to push about
>> 200Mbps per worker so far.
>>
>> I know some folks also like to put the manager and possibly the proxies
>>
>> on separate boxes from the workers, but I haven't gotten a good sense
>> as
>> to what kind of workload a proxy can handle. As far as proxies I've
>> mostly seen comments such as "I probably have way more proxies than I
>> need" or "Just keep adding proxies until they stop crashing".  I don't
>> currently have a spare box for the manager and proxy, but would be
>> curious to know if folks feel it is a necessity. My observations on my
>> own setup are that my Bro workers typically are using 99% of a logical
>> core at peak network times, and my manager 150-175% (multi-threaded).
>> My
>> workers seem to use about 1-2G of memory normally.
>>
>>
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> Bro mailing list
>> bro at bro-ids.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro


-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 6257 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20131206/ba74dc66/attachment.bin 


More information about the Bro mailing list