[Bro] Bro Packet Loss / 10gb ixgbe / pf_ring

Thu Jan 7 13:37:18 PST 2016

I’m trying to debug some packet drops that I’m experiencing and am turning to the list for help.   The recorded packet loss is ~50 – 70% at times.   The packet loss is recorded in broctl’s netstats as well as in the notice.log file.

Running netstats at startup – I’m dropping more than I’m receiving from the very start.

[BroControl] > netstats

 worker-1-1: 1452200459.635155 recvd=734100 dropped=1689718 link=2424079

worker-1-10: 1452200451.830143 recvd=718461 dropped=1414234 link=718461

worker-1-11: 1452200460.036766 recvd=481010 dropped=2019289 link=2500560

worker-1-12: 1452200460.239585 recvd=720895 dropped=1805574 link=2526730

worker-1-13: 1452200460.440611 recvd=753365 dropped=1800827 link=2554453

worker-1-14: 1452200460.647368 recvd=784145 dropped=1800831 link=2585237

worker-1-15: 1452200460.844842 recvd=750921 dropped=1868186 link=2619368

worker-1-16: 1452200461.049237 recvd=742718 dropped=1908528 link=2651507

…

System information:
- 64 AMD Opteron System
- 128gb of RAM
- Intel 10gb IXGBE interface (dual 10gb interfaces, eth3 is the sniffer)
- Licensed copy of PF_Ring ZC

I’m running Bro 2.4.1, PF_Ring 6.2.0 on Centos  / 2.6.32-411 kernel

I have the proxy, manager & 16 workers running on the same system.  16 CPUs are pinned (0-15)

Startup scripts to load the various kernel modules (from PF_RING 6.2.0 src)

insmod /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/net/pf_ring/pf_ring.ko enable_tx_capture=0 min_num_slots=32768 quick_mode=1

insmod  /lib/modules/2.6.32-431.11.2.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko numa_cpu_affinity=0,0 MQ=0,1 RSS=0,0

I checked /proc/sys/pci/devices to confirm that the interface is running on numa_node 0.  ‘lscpu’ shows that cpus 0-7 are one node 0, socket 0, and cpus 8-15 are on node 1, socket 0.  I figured having the 16 RSS queues on the same socket is probably better than having them bounce around.

I’ve disabled a bunch of the ixgbe offloading stuff:

ethtool -K eth3 rx off

ethtool -K eth3 tx off

ethtool -K eth3 sg off

ethtool -K eth3 tso off

ethtool -K eth3 gso off

ethtool -K eth3 gro off

ethtool -K eth3 lro off

ethtool -K eth3 rxvlan off

ethtool -K eth3 txvlan off

ethtool -K eth3 ntuple off

ethtool -K eth3 rxhash off

ethtool -K eth3 rx 32768

I’ve also tuned the stack, per recommendations from SANS:

net.ipv4.tcp_timestamps = 0

net.ipv4.tcp_sack = 0

net.ipv4.tcp_rmem = 10000000 10000000 10000000

net.ipv4.tcp_wmem = 10000000 10000000 10000000

net.ipv4.tcp_mem = 10000000 10000000 10000000

net.core.rmem_max = 134217728

net.core.wmem_max = 134217728

net.core.netdev_max_backlog = 250000

The node.cfg looks like this:

[manager]

type=manager

host=10.99.99.15

#

[proxy-1]

type=proxy

host=10.99.99.15

#

[worker-1]

type=worker

host=10.99.99.15

interface=eth3

lb_method=pf_ring

lb_procs=16

pin_cpus=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15

I have a license for ZC, and if I change the interface from eth3 to zc:eth3, it will spawn up 16 workers, but only one of them is receiving any traffic.  I’m assuming that it is looking at zc:eth3 at 0 only.   Netstats proves that out.   If I run pfcount –I zc at eth3, it will show me that I’m receiving ~1gbp/s of traffic on the interface and not dropping anything.

Am I missing something obvious?  I saw many threads about disabling hyper threading, but that seems specific to intel processors – I’m running AMD operterons with their own hyper transport stuff which doesn’t create virtual cpus.

Thanks,
 -Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20160107/d241bd89/attachment-0001.html