[Bro] Bro cluster requirements and manager logging backlog bug
Hovsep Levi
hovsep.sanjay.levi at gmail.com
Mon Dec 19 13:26:17 PST 2016
Hello all,
We are still having a problem with our Bro cluster and logging. During
peak times the manager will slowly consume all available memory while the
logs sent to disk are delayed by an hour or more.
Does anyone know the official bug ID for this within
bro-tracker.atlassian.net ?
I've tracked this problem for a while now and tried all variations of the
proposed fixes: the flare patch, the no-flare patch, segmented cluster with
one manager per box, and an architecture change from Linux+PF_RING to
FreeBSD+Myricom. Currently we are using a standard build of bro-2.5-beta
in a cluster configuration with one dedicated manager and three dedicated
sensors, each using both ports of a Myricom card with 22 workers attached
to each port. ( 1 manager, 1 logger, 12 proxies, 6 worker nodes (22 procs
each, 132 total).
Restarting the cluster on a regular basis is much easier without PF_RING
but that's only partially curing the symptom. In that regard the last
proposed solution is the most expensive, using faster CPUs which will
reduce the worker count. But will that really solve the problem ? I'm
more interested in defining what the problem actually is.
FWIW there's some text below to illustrate, the dates are somewhat old but
it's still a representative example.
21:05 UTC
- Manager node is near out of memory.. 2800 Mb left
- Workers have moderate CPU usage, 60%
- Logs on manager node are 25 minutes behind..
- 21:05 vs 20:40
- Initiated cluster restart at 21:06, completed at 21:11.
21:26 UTC
- Workers have moderate CPU usage.
- Logs are 16 minutes behind
Earlier the logs were roughly two hours behind.
[bro at mgr /opt/bro]$ date -r 1471373408 (most recent conn.log timestamp)
Tue Aug 16 18:50:08 UTC 2016
[bro at mgr /opt/bro]$ date
Tue Aug 16 20:43:45 UTC 2016
Bro manager process is using 70G of memory and the system is swapping:
last pid: 96557; load averages: 46.37, 53.09,
54.88 up
0+18:06:24 21:25:17
55 processes: 8 running, 47 sleeping
CPU: 7.7% user, 2.1% nice, 68.1% system, 0.2% interrupt, 21.9% idle
Mem: 103G Active, 2412M Inact, 19G Wired, 549M Cache, 331M Free
ARC: 15G Total, 89M MFU, 15G MRU, 29M Anon, 68M Header, 211M Other
Swap: 12G Total, 12G Used, 85M Free, 99% Inuse, 9248K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU
COMMAND
7305 bro 34 20 0 40121M 39498M uwait 10 31.7H 280.27% bro
7337 bro 1 96 5 70653M 61577M CPU36 36 868:45 59.96% bro
Currently in this state the logs over two hours behind the current time.
bro at mgr:~ % date -r 1471374952 (most recent conn.log timestamp)
Tue Aug 16 19:15:52 UTC 2016
bro at mgr:~ % date
Tue Aug 16 21:27:04 UTC 2016
Memory usage over the past week:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20161219/ff9a0e58/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: memory-week.png
Type: image/png
Size: 28315 bytes
Desc: not available
Url : http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20161219/ff9a0e58/attachment-0001.bin
More information about the Bro
mailing list