[Bro] High-CPU on just a single worker in the cluster

Thu Apr 14 05:26:43 PDT 2016

I'm already capturing the node in conn.log but haven't been able to spot anything out off the ordinary compared to the other nodes.

Below is a fresh 'netstats' from this morning (WIN_INT-9 is obviously the culprit).

  MID_INT-1: 1460636199.609909 recvd=327807625 dropped=1 link=327807625
 MID_INT-10: 1460636199.813720 recvd=339804389 dropped=4 link=339804389
  MID_INT-2: 1460636200.010047 recvd=313901304 dropped=0 link=313901304
  MID_INT-3: 1460636200.210033 recvd=323507786 dropped=1 link=323507786
  MID_INT-4: 1460636200.413951 recvd=322338069 dropped=0 link=322338069
  MID_INT-5: 1460636200.613996 recvd=314681107 dropped=1 link=314681107
  MID_INT-6: 1460636200.814761 recvd=325488973 dropped=1 link=325488973
  MID_INT-7: 1460636201.017945 recvd=328830658 dropped=3 link=328830658
  MID_INT-8: 1460636201.218113 recvd=338250015 dropped=0 link=338250015
  MID_INT-9: 1460636201.417949 recvd=387979776 dropped=0 link=387979776
  WIN_INT-1: 1460636288.903341 recvd=142474122 dropped=1 link=142474122
 WIN_INT-10: 1460636289.103648 recvd=232076131 dropped=1 link=232076131
  WIN_INT-2: 1460636289.303290 recvd=145451659 dropped=2 link=145451659
  WIN_INT-3: 1460636289.507242 recvd=182345947 dropped=0 link=182345947
  WIN_INT-4: 1460636289.707591 recvd=140378820 dropped=1 link=140378820
  WIN_INT-5: 1460636289.911410 recvd=140342198 dropped=0 link=140342198
  WIN_INT-6: 1460636290.111178 recvd=138961706 dropped=0 link=138961706
  WIN_INT-7: 1460636290.315433 recvd=198792251 dropped=0 link=198792251
  WIN_INT-8: 1460636290.515158 recvd=170824302 dropped=3 link=170824302
  WIN_INT-9: 1460636287.108095 recvd=2414368833 dropped=438939600 link=2414368833

> On Apr 13, 2016, at 9:43 PM, Azoff, Justin S <jazoff at illinois.edu> wrote:
> 
> Can you load this script that will add a node column to the conn.log that says which node handled that connection:
> 
> https://github.com/broala/bro-snippets/blob/master/add-node-to-conn.bro
> 
> also, what 'broctl netstats' outputs would be useful to see.
> 
> 
> -- 
> - Justin Azoff
> 
>> On Apr 13, 2016, at 7:03 PM, Dave Crawford <bro at pingtrip.com> wrote:
>> 
>> I'm in the process of trying to debug an odd high-cpu issue and looking for guidance.
>> 
>> The deployment is a follows:
>> - Cluster has with two nodes, each with 10 workers and the workers are pinned to specific cpu cores.
>> - x520 with PF_RING
>> - Traffic to each node is load balanced equally
>> 
>> The issue is that one worker on one of the nodes is always at 100% CPU while all other workers are around 50%. If I restart Bro a different worker will pin to 100%, but always on the same node.
>> 
>> I ran 'strace' on both a "bad" and "good" worker and one anomaly I spotted was that the "bad" worker never called 'nanosleep', whereas the "good" worker had about 84,000 'nanosleep' calls in the same amount of time.
>> 
>> I'm wondering if its possible for a queue to go bad on the x520, which might explain why its a random worker on the same node after restarting.
>> 
>> Is there a way to determine which x520 queue a specific worker is reading from? 
>> 
>> Thanks,
>> -Dave
>> 
>> 
>> _______________________________________________
>> Bro mailing list
>> bro at bro-ids.org
>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>