[Bro] High-CPU on just a single worker in the cluster

Dave Crawford bro at pingtrip.com
Thu Apr 14 06:55:09 PDT 2016


Here is the frequency distribution (I'm logging in JSON so different command but same results):

$ zcat 2016-04-14/conn.08\:00\:00-09\:00\:00.log.gz | jq -r '.node' | sort | uniq -c
 
192470 MID_GLR-1
 192325 MID_GLR-2
 193491 MID_GLR-3
 192444 MID_GLR-4
 192288 MID_GLR-5
 654908 MID_INT-1
 655252 MID_INT-10
 652749 MID_INT-2
 657361 MID_INT-3
 655236 MID_INT-4
 656477 MID_INT-5
 654518 MID_INT-6
 656199 MID_INT-7
 656069 MID_INT-8
 655798 MID_INT-9
 770362 WIN_INT-1
 772679 WIN_INT-10
 773520 WIN_INT-2
 772197 WIN_INT-3
 768066 WIN_INT-4
 771089 WIN_INT-5
 771734 WIN_INT-6
 772599 WIN_INT-7
 772359 WIN_INT-8
 721526 WIN_INT-9


You may be on to something with the non-ip traffic... there is a drastic difference between the two datacenters:

WIN
1460641772.239436 pkts=10414545 kpps=208.2 kbytes=5732528 mbps=938.6 nic_pkts=10414545 nic_drops=0 u=104675 t=3627503 i=307 o=405 nonip=6681655

MID
1460641723.573448 pkts=9553569 kpps=178.9 kbytes=6561123 mbps=1006.6 nic_pkts=9553569 nic_drops=0 u=174140 t=9373195 i=267 o=934 nonip=5033


-Dave


> On Apr 14, 2016, at 9:25 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:
> 
>> On Apr 14, 2016, at 8:26 AM, Dave Crawford <bro at pingtrip.com> wrote:
>> 
>> I'm already capturing the node in conn.log but haven't been able to spot anything out off the ordinary compared to the other nodes.
>> 
>> Below is a fresh 'netstats' from this morning (WIN_INT-9 is obviously the culprit).
>> 
>> MID_INT-1: 1460636199.609909 recvd=327807625 dropped=1 link=327807625
>> MID_INT-10: 1460636199.813720 recvd=339804389 dropped=4 link=339804389
>> MID_INT-2: 1460636200.010047 recvd=313901304 dropped=0 link=313901304
>> MID_INT-3: 1460636200.210033 recvd=323507786 dropped=1 link=323507786
>> MID_INT-4: 1460636200.413951 recvd=322338069 dropped=0 link=322338069
>> MID_INT-5: 1460636200.613996 recvd=314681107 dropped=1 link=314681107
>> MID_INT-6: 1460636200.814761 recvd=325488973 dropped=1 link=325488973
>> MID_INT-7: 1460636201.017945 recvd=328830658 dropped=3 link=328830658
>> MID_INT-8: 1460636201.218113 recvd=338250015 dropped=0 link=338250015
>> MID_INT-9: 1460636201.417949 recvd=387979776 dropped=0 link=387979776
>> WIN_INT-1: 1460636288.903341 recvd=142474122 dropped=1 link=142474122
>> WIN_INT-10: 1460636289.103648 recvd=232076131 dropped=1 link=232076131
>> WIN_INT-2: 1460636289.303290 recvd=145451659 dropped=2 link=145451659
>> WIN_INT-3: 1460636289.507242 recvd=182345947 dropped=0 link=182345947
>> WIN_INT-4: 1460636289.707591 recvd=140378820 dropped=1 link=140378820
>> WIN_INT-5: 1460636289.911410 recvd=140342198 dropped=0 link=140342198
>> WIN_INT-6: 1460636290.111178 recvd=138961706 dropped=0 link=138961706
>> WIN_INT-7: 1460636290.315433 recvd=198792251 dropped=0 link=198792251
>> WIN_INT-8: 1460636290.515158 recvd=170824302 dropped=3 link=170824302
>> WIN_INT-9: 1460636287.108095 recvd=2414368833 dropped=438939600 link=2414368833
>> 
> 
> Ahh, well that last worker is seeing almost double the number of packets than all of the other workers on that host combined, so that explains the CPU usage.
> 
> What does a frequency distribution of the node column from your conn.log from around that time show?
> 
>    zcat conn.....gz | bro-cut node | sort | uniq -c
> 
> Adding this to local.bro
> 
>    @load misc/stats
> 
> If you don't have it already will give you a stats.log which will contain some helpful information too.
> 
> If I had to guess, there's probably something going on traffic wise.. if I had to guess the WIN box is seeing a ton of non-ip traffic that all gets load balanced to the same worker.
> 
> Can you run capstats on the two boxes and compare the output?
> 
>    # capstats -i p1p1 -I 2
>    1460639895.723665 pkts=6642 kpps=2.8 kbytes=2100 mbps=7.2 nic_pkts=6642 nic_drops=0 u=0 t=0 i=0 o=0 nonip=6642
> 
> I just noticed that capstats doesn't properly handle vlan encapsulated packets, so all of our traffic shows up as nonip.. I'll look into fixing that, but if you are not using vlans looking at the breakdown of udp,tcp,ip,other,nonip (u,t,i,o,nonip) would help.
> 
> -- 
> - Justin Azoff

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20160414/c8caf449/attachment-0001.html 


More information about the Bro mailing list