[Bro] building a new bro server

Gary Faulkner gfaulkner.nsm at gmail.com
Tue Dec 9 13:25:04 PST 2014


For perspective I currently have a bro cluster comprised of 3 physical 
hosts. The first host runs the manager, proxies, and has storage to 
handle lots of bro logs and keep them for several months, the other two 
are dedicated to workers with relatively little storage. We have a 
hardware load-balancer to distribute traffic as evenly as possible 
between the worker nodes, and some effort has been made to limit having 
to process really large uninteresting flows before they reach the 
cluster. I looked at one of our typically busier blocks of time today 
(10:00-14:00) and during that time the cluster was seeing an average of 
10Gbps of traffic with peaks as high as 15Gbps. Looking at our traffic 
graphs and capstats showed each host typically was seeing around 50% of 
that load, or around 5Gbps on average. During this time we saw an 
average capture loss of around 0.47%, with a max loss of 22.53%. During 
that same time-frame I had 18 snapshots where individual workers 
reported loss over 5%, and 2 over 10% out of 748. So, I'd say each host 
is probably seeing about the same amount of traffic as you have 
described, but loaded scripts etc may vary from your configuration. We 
have 22 workers per host for a total of 44 workers, and I believe the 
capture loss script is sampling traffic over 15 minute intervals by 
default, so there are roughly 17 time slices for each worker. Here are 
some details of how those nodes are configured in terms of hardware and bro.

2 worker hosts each with:
2xE5-2697v2 (12 Cores / 24 HT) 2.7Ghz/3.5Ghz Turbo
256GB RAM (probably overkill, but I used to have the manager and proxies 
running on one of the hosts and it skewed my memory use quite a bit)
Intel X520-DA2 NIC
Bro 2.3-7 (git master at the time I last updated)
22 workers
PF_RING 5.6.2 using DNA IXGBE drivers, and pfdnacluster_master script
CPU's pinned (used OS to verify which core presented to the OS mapped to 
each physical core to avoid mapping 2 workers to the same physical 
cores, and didn't use the 1st core on each CPU)
HT is not disabled on these hosts and I'm still using the OS malloc.

Worker configs like this:
[worker-1]
type=worker
host=10.10.10.10
interface=dnacluster:21
lb_procs=22
lb_method=pf_ring
pin_cpus=2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23

I suspect the faster CPUs will handle bursty flows better such as when a 
large volume of traffic load balances to a single worker, while more 
cores will probably help when you can better distribute the workload 
more evenly. This led me to try to pick something that balanced the 2 
options (more cores vs higher clock speed. Naturally YMMV, and your 
traffic may not look like mine.

Hope this helps.

Regards,
Gary

On 12/9/2014 12:00 PM, Seth Hall wrote:
>> On Dec 8, 2014, at 10:57 PM, Allen, Brian <BrianAllen at wustl.edu> wrote:
>>
>> We saw a huge improvement when we went from 16Gig RAM to 128Gig RAM. (That one was pretty obvious so we did that first).  We also saw improvement when we pinned the processes to the cores.
> I think I had also suggested that you move to tcmalloc.  Have you tried that yet?  It’s not going to fix your issue with 30% packet loss, but I expect it would cut it down a bit further.
>   
>    .Seth
>
> --
> Seth Hall
> International Computer Science Institute
> (Bro) because everyone has a network
> http://www.bro.org/
>




More information about the Bro mailing list