[Bro] Bro cluster requirements and manager logging backlog bug

Hovsep Levi hovsep.sanjay.levi at gmail.com
Mon Dec 19 13:26:17 PST 2016


Hello all,


We are still having a problem with our Bro cluster and logging.  During
peak times the manager will slowly consume all available memory while the
logs sent to disk are delayed by an hour or more.

Does anyone know the official bug ID for this within
bro-tracker.atlassian.net ?

I've tracked this problem for a while now and tried all variations of the
proposed fixes: the flare patch, the no-flare patch, segmented cluster with
one manager per box, and an architecture change from Linux+PF_RING to
FreeBSD+Myricom.  Currently we are using a standard build of bro-2.5-beta
in a cluster configuration with one dedicated manager and three dedicated
sensors, each using both ports of a Myricom card with 22 workers attached
to each port.  ( 1 manager, 1 logger, 12 proxies, 6 worker nodes (22 procs
each, 132 total).

Restarting the cluster on a regular basis is much easier without PF_RING
but that's only partially curing the symptom.  In that regard the last
proposed solution is the most expensive, using faster CPUs which will
reduce the worker count.  But will that really solve the problem ?  I'm
more interested in defining what the problem actually is.


FWIW there's some text below to illustrate, the dates are somewhat old but
it's still a representative example.

21:05 UTC
- Manager node is near out of memory.. 2800 Mb left
- Workers have moderate CPU usage, 60%
- Logs on manager node are 25 minutes behind..
- 21:05 vs 20:40
- Initiated cluster restart at 21:06, completed at 21:11.

21:26 UTC
- Workers have moderate CPU usage.
- Logs are 16 minutes behind



Earlier the logs were roughly two hours behind.

[bro at mgr /opt/bro]$ date -r 1471373408  (most recent conn.log timestamp)
Tue Aug 16 18:50:08 UTC 2016

[bro at mgr /opt/bro]$ date
Tue Aug 16 20:43:45 UTC 2016



Bro manager process is using 70G of memory and the system is swapping:

last pid: 96557;  load averages: 46.37, 53.09,
54.88                                                                    up
0+18:06:24  21:25:17
55 processes:  8 running, 47 sleeping
CPU:  7.7% user,  2.1% nice, 68.1% system,  0.2% interrupt, 21.9% idle
Mem: 103G Active, 2412M Inact, 19G Wired, 549M Cache, 331M Free
ARC: 15G Total, 89M MFU, 15G MRU, 29M Anon, 68M Header, 211M Other
Swap: 12G Total, 12G Used, 85M Free, 99% Inuse, 9248K In

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU
COMMAND
 7305 bro          34  20    0 40121M 39498M uwait  10  31.7H 280.27% bro
 7337 bro           1  96    5 70653M 61577M CPU36  36 868:45  59.96% bro



Currently in this state the logs over two hours behind the current time.

bro at mgr:~ % date -r 1471374952  (most recent conn.log timestamp)
Tue Aug 16 19:15:52 UTC 2016

bro at mgr:~ % date
Tue Aug 16 21:27:04 UTC 2016



Memory usage over the past week:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20161219/ff9a0e58/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory-week.png
Type: image/png
Size: 28315 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20161219/ff9a0e58/attachment-0001.bin 


More information about the Bro mailing list