[Bro] Bro Packet Loss / 10gb ixgbe / pf_ring

Nash, Paul Paul.Nash at tufts.edu
Thu Jan 7 16:22:08 PST 2016


Thanks Mike -
  I'm using 16 workers because the ixgbe 10gb nic support hardware receive side scaling.  16 is the max number of queues that it supports.  While trying to monitor traffic this afternoon, I was seeing ~700 - 800mb/s based on pfcount stats.

If I disabled the hardware RSS I'd have to switch over to using pf_ring standard or DNA/ZC.  I have a license for ZC, but I've been unable to figure out how to get bro to monitor all of the zc:eth3 queues. The current Bro load-balancing documentation only covers pf_ring+DNA, but not the newer/supported zero-copy functionality.  I can't find the right "interface=" configuration for node.cfg.

"interface=zc:eth3" only monitors one of the queues.
interface="zc:eth3 at 0,zc:eth3 at 1,etc.." causes the workers to crash
interface="zc:eth3 at 0 -i zc:eth3 at 1 -i .." didn't work either.

The pf_ringZC documentation implies the use of zbalance_ipc to start up a set of queues and a cluster ID, with a call to zc:## where ## is the clusterID.  I also ran into issues with that.

For tonight, I'll disable the hardware RSS and switch over to running straight pf_ring with 24 workers.  I'll pin the first 8 so that they are on the same numa node as the NIC. Not sure what to do with the other 16 workers - does anyone have any insight if it is better to pin them to the same socket? I'm on AMD, which isn't as well documented as the intel world.


Thanks,
 -Paul

________________________________
From: reevesmk at gmail.com [reevesmk at gmail.com] on behalf of Mike Reeves [luke at geekempire.com]
Sent: Thursday, January 07, 2016 5:15 PM
To: Nash, Paul
Cc: bro at bro.org
Subject: Re: [Bro] Bro Packet Loss / 10gb ixgbe / pf_ring

Change your min_num_slots to be 65535. I would add an additional proxy as well as an additional 8 workers.



On Thu, Jan 7, 2016 at 4:37 PM, Nash, Paul <Paul.Nash at tufts.edu<mailto:Paul.Nash at tufts.edu>> wrote:

I’m trying to debug some packet drops that I’m experiencing and am turning to the list for help.   The recorded packet loss is ~50 – 70% at times.   The packet loss is recorded in broctl’s netstats as well as in the notice.log file.

Running netstats at startup – I’m dropping more than I’m receiving from the very start.


[BroControl] > netstats

 worker-1-1: 1452200459.635155 recvd=734100 dropped=1689718 link=2424079

worker-1-10: 1452200451.830143 recvd=718461 dropped=1414234 link=718461

worker-1-11: 1452200460.036766 recvd=481010 dropped=2019289 link=2500560

worker-1-12: 1452200460.239585 recvd=720895 dropped=1805574 link=2526730

worker-1-13: 1452200460.440611 recvd=753365 dropped=1800827 link=2554453

worker-1-14: 1452200460.647368 recvd=784145 dropped=1800831 link=2585237

worker-1-15: 1452200460.844842 recvd=750921 dropped=1868186 link=2619368

worker-1-16: 1452200461.049237 recvd=742718 dropped=1908528 link=2651507

…

System information:
- 64 AMD Opteron System
- 128gb of RAM
- Intel 10gb IXGBE interface (dual 10gb interfaces, eth3 is the sniffer)
- Licensed copy of PF_Ring ZC

I’m running Bro 2.4.1, PF_Ring 6.2.0 on Centos  / 2.6.32-411 kernel

I have the proxy, manager & 16 workers running on the same system.  16 CPUs are pinned (0-15)

Startup scripts to load the various kernel modules (from PF_RING 6.2.0 src)


insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/net/pf_ring/pf_ring.ko enable_tx_capture=0 min_num_slots=32768 quick_mode=1

insmod  /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko numa_cpu_affinity=0,0 MQ=0,1 RSS=0,0

I checked /proc/sys/pci/devices to confirm that the interface is running on numa_node 0.  ‘lscpu’ shows that cpus 0-7 are one node 0, socket 0, and cpus 8-15 are on node 1, socket 0.  I figured having the 16 RSS queues on the same socket is probably better than having them bounce around.

I’ve disabled a bunch of the ixgbe offloading stuff:


ethtool -K eth3 rx off

ethtool -K eth3 tx off

ethtool -K eth3 sg off

ethtool -K eth3 tso off

ethtool -K eth3 gso off

ethtool -K eth3 gro off

ethtool -K eth3 lro off

ethtool -K eth3 rxvlan off

ethtool -K eth3 txvlan off

ethtool -K eth3 ntuple off

ethtool -K eth3 rxhash off

ethtool -K eth3 rx 32768

I’ve also tuned the stack, per recommendations from SANS:


net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_sack = 0

net.ipv4.tcp_rmem = 10000000 10000000 10000000

net.ipv4.tcp_wmem = 10000000 10000000 10000000

net.ipv4.tcp_mem = 10000000 10000000 10000000

net.core.rmem_max = 134217728

net.core.wmem_max = 134217728

net.core.netdev_max_backlog = 250000


The node.cfg looks like this:


[manager]

type=manager

host=10.99.99.15


#

[proxy-1]

type=proxy

host=10.99.99.15


#

[worker-1]

type=worker

host=10.99.99.15

interface=eth3

lb_method=pf_ring

lb_procs=16

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15


I have a license for ZC, and if I change the interface from eth3 to zc:eth3, it will spawn up 16 workers, but only one of them is receiving any traffic.  I’m assuming that it is looking at zc:eth3 at 0 only.   Netstats proves that out.   If I run pfcount –I zc at eth3, it will show me that I’m receiving ~1gbp/s of traffic on the interface and not dropping anything.

Am I missing something obvious?  I saw many threads about disabling hyper threading, but that seems specific to intel processors – I’m running AMD operterons with their own hyper transport stuff which doesn’t create virtual cpus.

Thanks,
 -Paul

_______________________________________________
Bro mailing list
bro at bro-ids.org<mailto:bro at bro-ids.org>
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20160108/66081c16/attachment-0001.html 


More information about the Bro mailing list