[Bro] Sizing a Bro Cluster [was out of memory after a couple days]
Gary Faulkner
gary at doit.wisc.edu
Fri Dec 6 18:40:33 PST 2013
Bob,
So, probably closer to 10-12 workers per host in my case since the
proxies and manager are there? It was suggested to me early on to try to
acquire another box to separate from manager and proxies, but I don't
have one quite yet so I've been trying to make it work as is. It didn't
seem like the worker child processes needed much CPU time, so I thought
I could push the worker count higher and it also seemed like I got less
loss per broctl netstat, but others have suggested that maybe the broctl
netstat command wasn't the most reliable way to judge that. I ended up
at 4 proxies mostly because two didn't seem stable and I like symmetry
so I jumped straight to 4.
Thanks again,
Gary
On 12/6/2013 7:57 PM, Bob wrote:
> Gary,
>
> Ah, my apologies for my misreading of your specs. Two hosts seeing 2-4Gb/s each should be just fine.
>
> As far as my recommendation, I definitely meant real cores. I find, based on my own experimentation, that taking your number of workers higher than the number of physical cores in the box tends to hurt the performance instead of helping. In the case where a box was also doing manager/proxy duty, I'd subtract a physical core for each of those as well (I've broken those out into a separate box to avoid that issue on mine). I don't recall seeing anything from Vlad suggesting going higher, but I can't rule out the possibility that I missed it.
>
> Also to answer a question that I overlooked earlier, I'm running 3 proxies on that workload. I got to that number by just increasing by one until it stabilized.
>
> Bob
>
> Gary Faulkner <gary at doit.wisc.edu> wrote:
>> Bob,
>>
>> Thank you for the feedback. To clarify I am currently using two
>> physical
>> hosts clustered together (using broctl), so each box ends up with 64G
>> of
>> RAM and 16cores/32 threads and Intel IXGBE 10G Cards +
>> pf_ring/DNA/libzero for distributing packets on the host. Each physical
>>
>> host then sees between 2-4Gbps and has 20 workers + 2 proxies. I recall
>>
>> reading a blog entry by Martin Holste where he mentioned allocating
>> only
>> half as many workers as you had logical cores/threads, but also seem to
>>
>> recall others (Vlad G.) pushing nearly as many workers as logical
>> cores,
>> but could have read to much into it. Are you referring to physical
>> cores
>> or logical cores/threads? If it is the former I think what you are
>> saying is inline with what Martin suggested; although I had hoped I
>> could push the worker count a bit higher based on what I thought I had
>> read elsewhere.
>>
>> Regards,
>> Gary
>>
>> On 12/6/2013 7:08 PM, Bob wrote:
>>> Gary,
>>>
>>> Realistically, for that load I would recommend looking into a
>> cluster. My personal sizing criteria is 4-5Gb/s max per box. That's
>> for a box that's roughly the same as yours and that uses up about
>> 40-50GB of RAM per box (although for as cheap as it is, I recommend to
>> always guess high on the RAM). For this size of box, you can probably
>> improve your performance by reducing the number of workers (one per
>> real core is a good benchmark). I am conservative, so I like to keep a
>> couple if cores free for system tasks to ensure reliable performance
>> (setting the number of workers to something like 12 or 14), but that's
>> up to you.
>>>
>>> As a caveat, the aforementioned recommendations assume that you're
>> using a network card that's designed to do this work (like an Intel
>> ixgbe card with pf_ring or a Myricom). If your card can't bypass the
>> kernel's interfaces, then you're going to need a lot more hardware to
>> get the same performance because you're spending CPU time shoving the
>> packets through the kernel instead of just accessing them directly on
>> the NIC
>>>
>>> That's my two cents.
>>>
>>> Bob
>>>
>>> Gary Faulkner <gary at doit.wisc.edu> wrote:
>>>> I've had some proxy crashes in the past and it was suggested that I
>>>> increase my number of proxies -- which I did until my environment
>>>> appeared stable for about a week. After being stable for about a
>> week I
>>>>
>>>> started to run out of memory, and in subsequent restarts have been
>>>> running out of memory after about 24 hours of operation, typically
>>>> during non-peak times (50% of normal traffic). Naturally I'm
>> wondering
>>>> if I'm just doing it wrong and if my set-up is appropriately sized
>> and
>>>> configured to handle the load I'm asking it to deal with.
>>>>
>>>> I think I've seen folks on the list that were running Bro on similar
>>>> hardware that might be able to tell me if my configuration is
>> anything
>>>> close to what works for them. I'm also curious how other folks
>>>> determine
>>>> how many proxies they need, how many workers per host etc.
>>>>
>>>> I'm mostly running Bro 2.2 stock with default scripts, and only
>> minor
>>>> edits to local.bro to test out email notices. I'm only using these
>>>> systems for Bro, although they were originally from another project
>> so
>>>> they weren't necessarily ordered with Bro specs in mind.
>>>>
>>>> Here's how I've got things allocated currently:
>>>>
>>>> Bro -1 Host:
>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>> 64G RAM
>>>> manager
>>>> 2 proxies
>>>> 20 workers
>>>> 2-4GB of Traffic
>>>>
>>>> Bro-2 Host:
>>>> 2ea Xeon E5-2670 at 2.6Ghz (32 combined Logical Cores / 16 Physical)
>>>> 64G RAM
>>>> 2 proxies
>>>> 20 workers
>>>> 2-4 GB of traffic
>>>>
>>>> The following is a relatively light traffic load (late on a Friday)
>> for
>>>>
>>>> my install (4Gbps vs 8Gbps):
>>>>
>>>> bro-1 $ ./broctl capstats
>>>>
>>>> Interface kpps mbps (10s average)
>>>> ------------------------------
>>>> 192.168.0.10/dnacluster:21 338.6 2327.4
>>>> 192.168.0.11/dnacluster:22 324.8 2264.7
>>>>
>>>> Total 663.4 4592.1
>>>>
>>>> bro-1 $ ./broctl top
>>>> Name Type Node Pid Proc VSize Rss
>>>> Cpu Cmd
>>>> manager manager 192.168.0.10 14816 parent 2G 736M
>>>> 88% bro
>>>> manager manager 192.168.0.10 14817 child 169M 93M
>>>> 44% bro
>>>> proxy-1 proxy 192.168.0.10 14863 child 102M 26M
>>>> 23% bro
>>>> proxy-1 proxy 192.168.0.10 14860 parent 1G 1G
>>>> 3% bro
>>>> proxy-2 proxy 192.168.0.10 14862 child 102M 28M
>>>> 27% bro
>>>> proxy-2 proxy 192.168.0.10 14861 parent 1G 1G
>>>> 3% bro
>>>> proxy-3 proxy 192.168.0.11 28900 child 102M 46M
>>>> 20% bro
>>>> proxy-3 proxy 192.168.0.11 28898 parent 1G 1G
>>>> 1% bro
>>>> proxy-4 proxy 192.168.0.11 28899 child 102M 45M
>>>> 21% bro
>>>> proxy-4 proxy 192.168.0.11 28897 parent 1G 1G
>>>> 1% bro
>>>> worker-1-1 worker 192.168.0.10 15228 parent 2G 2G
>>>> 65% bro
>>>> worker-1-1 worker 192.168.0.10 15398 child 514M 11M
>>>> 10% bro
>>>> worker-1-10 worker 192.168.0.10 15230 parent 2G 2G
>>>> 53% bro
>>>> worker-1-10 worker 192.168.0.10 15407 child 514M 12M
>>>> 8% bro
>>>> worker-1-11 worker 192.168.0.10 15234 parent 2G 2G
>>>> 78% bro
>>>> worker-1-11 worker 192.168.0.10 15286 child 514M 9M
>>>> 11% bro
>>>> worker-1-12 worker 192.168.0.10 15235 parent 2G 2G
>>>> 67% bro
>>>> worker-1-12 worker 192.168.0.10 15267 child 514M 8M
>>>> 12% bro
>>>> worker-1-13 worker 192.168.0.10 15237 parent 2G 2G
>>>> 82% bro
>>>> worker-1-13 worker 192.168.0.10 15392 child 514M 9M
>>>> 12% bro
>>>> worker-1-14 worker 192.168.0.10 15238 parent 2G 2G
>>>> 43% bro
>>>> worker-1-14 worker 192.168.0.10 15264 child 514M 11M
>>>> 8% bro
>>>> worker-1-15 worker 192.168.0.10 15240 parent 2G 2G
>>>> 76% bro
>>>> worker-1-15 worker 192.168.0.10 15300 child 514M 7M
>>>> 9% bro
>>>> worker-1-16 worker 192.168.0.10 15243 parent 2G 2G
>>>> 94% bro
>>>> worker-1-16 worker 192.168.0.10 15404 child 514M 11M
>>>> 9% bro
>>>> worker-1-17 worker 192.168.0.10 15244 parent 2G 2G
>>>> 67% bro
>>>> worker-1-17 worker 192.168.0.10 15383 child 514M 8M
>>>> 8% bro
>>>> worker-1-18 worker 192.168.0.10 15246 parent 2G 2G
>>>> 80% bro
>>>> worker-1-18 worker 192.168.0.10 15372 child 514M 12M
>>>> 11% bro
>>>> worker-1-19 worker 192.168.0.10 15248 parent 2G 2G
>>>> 76% bro
>>>> worker-1-19 worker 192.168.0.10 15376 child 514M 8M
>>>> 8% bro
>>>> worker-1-2 worker 192.168.0.10 15251 parent 2G 2G
>>>> 83% bro
>>>> worker-1-2 worker 192.168.0.10 15414 child 514M 11M
>>>> 10% bro
>>>> worker-1-20 worker 192.168.0.10 15254 parent 2G 2G
>>>> 86% bro
>>>> worker-1-20 worker 192.168.0.10 15417 child 514M 12M
>>>> 11% bro
>>>> worker-1-3 worker 192.168.0.10 15253 parent 2G 2G
>>>> 55% bro
>>>> worker-1-3 worker 192.168.0.10 15375 child 514M 8M
>>>> 12% bro
>>>> worker-1-4 worker 192.168.0.10 15256 parent 2G 2G
>>>> 87% bro
>>>> worker-1-4 worker 192.168.0.10 15388 child 515M 8M
>>>> 10% bro
>>>> worker-1-5 worker 192.168.0.10 15257 parent 2G 2G
>>>> 58% bro
>>>> worker-1-5 worker 192.168.0.10 15395 child 515M 11M
>>>> 10% bro
>>>> worker-1-6 worker 192.168.0.10 15258 parent 2G 2G
>>>> 96% bro
>>>> worker-1-6 worker 192.168.0.10 15394 child 514M 11M
>>>> 8% bro
>>>> worker-1-7 worker 192.168.0.10 15259 parent 2G 2G
>>>> 65% bro
>>>> worker-1-7 worker 192.168.0.10 15413 child 514M 12M
>>>> 6% bro
>>>> worker-1-8 worker 192.168.0.10 15260 parent 2G 2G
>>>> 99% bro
>>>> worker-1-8 worker 192.168.0.10 15401 child 514M 11M
>>>> 8% bro
>>>> worker-1-9 worker 192.168.0.10 15261 parent 2G 2G
>>>> 61% bro
>>>> worker-1-9 worker 192.168.0.10 15408 child 514M 11M
>>>> 8% bro
>>>> worker-2-1 worker 192.168.0.11 29961 parent 2G 2G
>>>> 85% bro
>>>> worker-2-1 worker 192.168.0.11 29984 child 514M 31M
>>>> 9% bro
>>>> worker-2-10 worker 192.168.0.11 29959 parent 2G 2G
>>>> 52% bro
>>>> worker-2-10 worker 192.168.0.11 30085 child 515M 31M
>>>> 8% bro
>>>> worker-2-11 worker 192.168.0.11 29960 parent 2G 2G
>>>> 96% bro
>>>> worker-2-11 worker 192.168.0.11 30112 child 514M 31M
>>>> 10% bro
>>>> worker-2-12 worker 192.168.0.11 29973 parent 2G 2G
>>>> 54% bro
>>>> worker-2-12 worker 192.168.0.11 30082 child 514M 30M
>>>> 8% bro
>>>> worker-2-13 worker 192.168.0.11 29967 parent 2G 2G
>>>> 93% bro
>>>> worker-2-13 worker 192.168.0.11 30111 child 514M 31M
>>>> 10% bro
>>>> worker-2-14 worker 192.168.0.11 29962 parent 2G 2G
>>>> 100% bro
>>>> worker-2-14 worker 192.168.0.11 30076 child 514M 30M
>>>> 8% bro
>>>> worker-2-15 worker 192.168.0.11 29975 parent 2G 2G
>>>> 55% bro
>>>> worker-2-15 worker 192.168.0.11 30138 child 514M 31M
>>>> 10% bro
>>>> worker-2-16 worker 192.168.0.11 29965 parent 2G 2G
>>>> 85% bro
>>>> worker-2-16 worker 192.168.0.11 29994 child 514M 31M
>>>> 8% bro
>>>> worker-2-17 worker 192.168.0.11 29968 parent 2G 2G
>>>> 76% bro
>>>> worker-2-17 worker 192.168.0.11 30097 child 514M 31M
>>>> 8% bro
>>>> worker-2-18 worker 192.168.0.11 29972 parent 2G 2G
>>>> 95% bro
>>>> worker-2-18 worker 192.168.0.11 30115 child 514M 30M
>>>> 10% bro
>>>> worker-2-19 worker 192.168.0.11 29964 parent 2G 2G
>>>> 68% bro
>>>> worker-2-19 worker 192.168.0.11 30092 child 514M 31M
>>>> 7% bro
>>>> worker-2-2 worker 192.168.0.11 29974 parent 2G 2G
>>>> 51% bro
>>>> worker-2-2 worker 192.168.0.11 30133 child 514M 31M
>>>> 7% bro
>>>> worker-2-20 worker 192.168.0.11 29966 parent 2G 2G
>>>> 59% bro
>>>> worker-2-20 worker 192.168.0.11 29981 child 514M 30M
>>>> 10% bro
>>>> worker-2-3 worker 192.168.0.11 29969 parent 2G 2G
>>>> 95% bro
>>>> worker-2-3 worker 192.168.0.11 30095 child 514M 31M
>>>> 8% bro
>>>> worker-2-4 worker 192.168.0.11 29970 parent 2G 2G
>>>> 95% bro
>>>> worker-2-4 worker 192.168.0.11 30137 child 514M 30M
>>>> 8% bro
>>>> worker-2-5 worker 192.168.0.11 29977 parent 2G 2G
>>>> 84% bro
>>>> worker-2-5 worker 192.168.0.11 30100 child 514M 31M
>>>> 10% bro
>>>> worker-2-6 worker 192.168.0.11 29978 parent 2G 2G
>>>> 73% bro
>>>> worker-2-6 worker 192.168.0.11 29990 child 514M 30M
>>>> 8% bro
>>>> worker-2-7 worker 192.168.0.11 29976 parent 2G 2G
>>>> 76% bro
>>>> worker-2-7 worker 192.168.0.11 30081 child 514M 31M
>>>> 10% bro
>>>> worker-2-8 worker 192.168.0.11 29963 parent 2G 2G
>>>> 57% bro
>>>> worker-2-8 worker 192.168.0.11 29987 child 514M 30M
>>>> 8% bro
>>>> worker-2-9 worker 192.168.0.11 29971 parent 2G 2G
>>>> 52% bro
>>>> worker-2-9 worker 192.168.0.11 30096 child 514M 31M
>>>> 10% bro
>>>>
>>>> bro-1 $ free -g
>>>> total used free shared buffers
>> cached
>>>> Mem: 62 62 0 0 0 17
>>>> -/+ buffers/cache: 44 17
>>>> Swap: 0 0 0
>>>>
>>>> bro-2 $ free -g
>>>> total used free shared buffers
>> cached
>>>> Mem: 62 45 17 0 0 1
>>>> -/+ buffers/cache: 44 18
>>>> Swap: 0 0 0
>>>>
>>>> What do you guys think?
>>>>
>>>> Regards,
>>>> Gary
>>>>
>>>> PS ~
>>>>
>>>> I've been reading the mailing list archives and it seems that folks
>>>> with
>>>> the older Xeons with higher clock rates (3.4Ghzish), but fewer cores
>>>> were able to handle upwards of 400-500Mbps per worker process. I've
>>>> also
>>>> seen it hinted, I think by Vlad G., that he was fitting in 28
>> workers
>>>> on
>>>> boxes with similar core counts to my own, but slightly faster
>>>> processors. Based on some of those remarks in previous threads I was
>>>> thinking I should be able to handle a little over 300Mbps per
>> process
>>>> with these processors, but I've only had the traffic to push about
>>>> 200Mbps per worker so far.
>>>>
>>>> I know some folks also like to put the manager and possibly the
>> proxies
>>>>
>>>> on separate boxes from the workers, but I haven't gotten a good
>> sense
>>>> as
>>>> to what kind of workload a proxy can handle. As far as proxies I've
>>>> mostly seen comments such as "I probably have way more proxies than
>> I
>>>> need" or "Just keep adding proxies until they stop crashing". I
>> don't
>>>> currently have a spare box for the manager and proxy, but would be
>>>> curious to know if folks feel it is a necessity. My observations on
>> my
>>>> own setup are that my Bro workers typically are using 99% of a
>> logical
>>>> core at peak network times, and my manager 150-175%
>> (multi-threaded).
>>>> My
>>>> workers seem to use about 1-2G of memory normally.
>>>>
>>>>
>>>>
>>>>
>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> Bro mailing list
>>>> bro at bro-ids.org
>>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>
More information about the Bro
mailing list