[Bro] High-CPU on just a single worker in the cluster
Azoff, Justin S
jazoff at illinois.edu
Thu Apr 14 06:25:37 PDT 2016
> On Apr 14, 2016, at 8:26 AM, Dave Crawford <bro at pingtrip.com> wrote:
> I'm already capturing the node in conn.log but haven't been able to spot anything out off the ordinary compared to the other nodes.
> Below is a fresh 'netstats' from this morning (WIN_INT-9 is obviously the culprit).
> MID_INT-1: 1460636199.609909 recvd=327807625 dropped=1 link=327807625
> MID_INT-10: 1460636199.813720 recvd=339804389 dropped=4 link=339804389
> MID_INT-2: 1460636200.010047 recvd=313901304 dropped=0 link=313901304
> MID_INT-3: 1460636200.210033 recvd=323507786 dropped=1 link=323507786
> MID_INT-4: 1460636200.413951 recvd=322338069 dropped=0 link=322338069
> MID_INT-5: 1460636200.613996 recvd=314681107 dropped=1 link=314681107
> MID_INT-6: 1460636200.814761 recvd=325488973 dropped=1 link=325488973
> MID_INT-7: 1460636201.017945 recvd=328830658 dropped=3 link=328830658
> MID_INT-8: 1460636201.218113 recvd=338250015 dropped=0 link=338250015
> MID_INT-9: 1460636201.417949 recvd=387979776 dropped=0 link=387979776
> WIN_INT-1: 1460636288.903341 recvd=142474122 dropped=1 link=142474122
> WIN_INT-10: 1460636289.103648 recvd=232076131 dropped=1 link=232076131
> WIN_INT-2: 1460636289.303290 recvd=145451659 dropped=2 link=145451659
> WIN_INT-3: 1460636289.507242 recvd=182345947 dropped=0 link=182345947
> WIN_INT-4: 1460636289.707591 recvd=140378820 dropped=1 link=140378820
> WIN_INT-5: 1460636289.911410 recvd=140342198 dropped=0 link=140342198
> WIN_INT-6: 1460636290.111178 recvd=138961706 dropped=0 link=138961706
> WIN_INT-7: 1460636290.315433 recvd=198792251 dropped=0 link=198792251
> WIN_INT-8: 1460636290.515158 recvd=170824302 dropped=3 link=170824302
> WIN_INT-9: 1460636287.108095 recvd=2414368833 dropped=438939600 link=2414368833
Ahh, well that last worker is seeing almost double the number of packets than all of the other workers on that host combined, so that explains the CPU usage.
What does a frequency distribution of the node column from your conn.log from around that time show?
zcat conn.....gz | bro-cut node | sort | uniq -c
Adding this to local.bro
If you don't have it already will give you a stats.log which will contain some helpful information too.
If I had to guess, there's probably something going on traffic wise.. if I had to guess the WIN box is seeing a ton of non-ip traffic that all gets load balanced to the same worker.
Can you run capstats on the two boxes and compare the output?
# capstats -i p1p1 -I 2
1460639895.723665 pkts=6642 kpps=2.8 kbytes=2100 mbps=7.2 nic_pkts=6642 nic_drops=0 u=0 t=0 i=0 o=0 nonip=6642
I just noticed that capstats doesn't properly handle vlan encapsulated packets, so all of our traffic shows up as nonip.. I'll look into fixing that, but if you are not using vlans looking at the breakdown of udp,tcp,ip,other,nonip (u,t,i,o,nonip) would help.
- Justin Azoff
More information about the Bro