[Bro] Bro's limitations with high worker count and memory exhaustion

Jan Grashofer jan.grashofer at cern.ch
Wed Jul 1 12:27:47 PDT 2015


You are not trying to run 140 workers on a single machine with 64GB memory, right?



________________________________
From: Baxter Milliwew [baxter.milliwew at gmail.com]
Sent: Wednesday, July 01, 2015 20:39
To: Siwek, Jon
Cc: Jan Grashofer; bro at bro.org
Subject: Re: [Bro] Bro's limitations with high worker count and memory exhaustion

Do you think a high worker count with the current implementation of select() would cause high memory usage ?

I'm trying to figure out why the manager always exhausts all memory:


top - 18:36:13 up 1 day, 14:42,  1 user,  load average: 12.67, 10.83, 10.95

Tasks: 606 total,   5 running, 601 sleeping,   0 stopped,   0 zombie

%Cpu(s): 15.3 us,  6.4 sy,  1.3 ni, 76.3 id,  0.0 wa,  0.0 hi,  0.7 si,  0.0 st

KiB Mem:  65939412 total, 65251768 used,   687644 free,    43248 buffers

KiB Swap: 67076092 total, 54857880 used, 12218212 free.  4297048 cached Mem


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

35046 logstash  20   0 10.320g 511600   3784 S 782.1  0.8   5386:34 java

 9925 bro       25   5 97.504g 0.045t   1508 R  99.7 73.8 814:58.88 bro

 9906 bro       20   0 22.140g 3.388g   3784 S  73.2  5.4   1899:18 bro

 2509 root      20   0  308440  44064    784 R  48.5  0.1   1029:56 redis-server

 2688 bro       30  10    4604   1440   1144 R  44.8  0.0   0:00.49 gzip

  180 root      20   0       0      0      0 S   8.2  0.0   4:26.54 ksoftirqd/8

 2419 debug     20   0   25376   3564   2600 R   7.3  0.0   0:00.76 top

 2689 logstash  20   0       8      4      0 R   5.5  0.0   0:00.06 bro


On Tue, Jun 30, 2015 at 11:37 AM, Baxter Milliwew <baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>> wrote:
Thanks.  Some limited reading says it's not possible to increase FD_SETSIZE on linux and it's time to migrate to poll().



On Tue, Jun 30, 2015 at 7:44 AM, Siwek, Jon <jsiwek at illinois.edu<mailto:jsiwek at illinois.edu>> wrote:
A guess is that you’re bumping into an FD_SETSIZE limit — the way remote I/O is currently structured has at least 5 file descriptors per remote connection from what I can see at a glance (a pair of pipes, 2 fds each, for signaling read/write readiness related to ChunkedIO and one fd for the actual socket).  Typically, FD_SETSIZE is 1024, so with ~150-200 remote connections and 5 fds per connection plus whatever other descriptors Bro may need to have open (e.g. for file I/O), it seems reasonable to guess that’s the problem.  But you could easily verify w/ some code modifications to check whether the FD_SET call is using a fd >= FD_SETSIZE.

Other than making involved code changes to Bro (e.g. to move away from select() for I/O event handling), the only suggestions I have are 1) reducing number of remote connections 2) see if you can increase FD_SETSIZE via preprocessor stuff or CFLAGS/CXXFLAGS upon ./configure’ing (I’ve never done this myself to know if it works, but I’ve googled around before and think the implication was that it may work on Linux).

- Jon

> On Jun 29, 2015, at 6:22 PM, Baxter Milliwew <baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>> wrote:
>
> The manager still crashes.  Interesting note about a buffer overflow.
>
>
> [manager]
>
> Bro 2.4
> Linux 3.16.0-38-generic
>
> core
> [New LWP 18834]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by `/usr/local/3rd-party/bro/bin/bro -U .status -p broctl -p broctl-live -p local -'.
> Program terminated with signal SIGABRT, Aborted.
> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>
> Thread 1 (Thread 0x............ (LWP 18834)):
> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
> #1  0x00007f163bb4a0d8 in __GI_abort () at abort.c:89
> #2  0x00007f163bb83394 in __libc_message (do_abort=do_abort at entry=2, fmt=fmt at entry=0x............ "*** %s ***: %s terminated\n") at ../sysdeps/posix/libc_fatal.c:175
> #3  0x00007f163bc1ac9c in __GI___fortify_fail (msg=<optimized out>, msg at entry=0x............ "buffer overflow detected") at fortify_fail.c:37
> #4  0x00007f163bc19b60 in __GI___chk_fail () at chk_fail.c:28
> #5  0x00007f163bc1abe7 in __fdelt_chk (d=<optimized out>) at fdelt_chk.c:25
> #6  0x00000000005e962a in Set (set=0x............, this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/iosource/FD_Set.h:59
> #7  SocketComm::Run (this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:3406
> #8  0x00000000005e9c31 in RemoteSerializer::Fork (this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:687
> #9  0x00000000005e9d4f in RemoteSerializer::Enable (this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:575
> #10 0x00000000005b6943 in BifFunc::bro_enable_communication (frame=<optimized out>, BiF_ARGS=<optimized out>) at bro.bif:4480
> #11 0x00000000005b431d in BuiltinFunc::Call (this=0x............, args=0x............, parent=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Func.cc:586
> #12 0x0000000000599066 in CallExpr::Eval (this=0x............, f=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Expr.cc:4544
> #13 0x000000000060ceb4 in ExprStmt::Exec (this=0x............, f=0x............, flow=@0x............: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:352
> #14 0x000000000060b174 in IfStmt::DoExec (this=0x............, f=0x............, v=<optimized out>, flow=@0x............: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:456
> #15 0x000000000060ced1 in ExprStmt::Exec (this=0x............, f=0x............, flow=@0x............: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:356
> #16 0x000000000060b211 in StmtList::Exec (this=0x............, f=0x............, flow=@0x............: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
> #17 0x000000000060b211 in StmtList::Exec (this=0x............, f=0x............, flow=@0x............: FLOW_NEXT) at /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
> #18 0x00000000005c042e in BroFunc::Call (this=0x............, args=<optimized out>, parent=0x0) at /home/bro/Bro-IDS/bro-2.4/src/Func.cc:403
> #19 0x000000000057ee2a in EventHandler::Call (this=0x............, vl=0x............, no_remote=no_remote at entry=false) at /home/bro/Bro-IDS/bro-2.4/src/EventHandler.cc:130
> #20 0x000000000057e035 in Dispatch (no_remote=false, this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Event.h:50
> #21 EventMgr::Dispatch (this=this at entry=0x...... <mgr>) at /home/bro/Bro-IDS/bro-2.4/src/Event.cc:111
> #22 0x000000000057e1d0 in EventMgr::Drain (this=0xbbd720 <mgr>) at /home/bro/Bro-IDS/bro-2.4/src/Event.cc:128
> #23 0x00000000005300ed in main (argc=<optimized out>, argv=<optimized out>) at /home/bro/Bro-IDS/bro-2.4/src/main.cc:1147
>
>
>
> On Mon, Jun 29, 2015 at 4:09 PM, Baxter Milliwew <baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>> wrote:
> Nevermind... new box, default nofile limits.  Thanks for the malloc tip.
>
>
> On Mon, Jun 29, 2015 at 4:03 PM, Baxter Milliwew <baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>> wrote:
> Switching to jemalloc fixed the stability issue but not the worker count limitation.
>
> On Sun, Jun 28, 2015 at 7:18 PM, Baxter Milliwew <baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>> wrote:
> Looks like malloc from glibc, default on Ubuntu.  I will try jemalloc and others.
>
>
>
> On Sun, Jun 28, 2015 at 1:03 AM, Jan Grashofer <jan.grashofer at cern.ch<mailto:jan.grashofer at cern.ch>> wrote:
> I experienced similar problems (memory gets eaten up quickly and workers crash with segfault) using tcmalloc. Which malloc do you use?
>
>
> Regards,
>
> Jan
>
>
> From: bro-bounces at bro.org<mailto:bro-bounces at bro.org> [bro-bounces at bro.org<mailto:bro-bounces at bro.org>] on behalf of Baxter Milliwew [baxter.milliwew at gmail.com<mailto:baxter.milliwew at gmail.com>]
> Sent: Friday, June 26, 2015 23:03
> To: bro at bro.org<mailto:bro at bro.org>
> Subject: [Bro] Bro's limitations with high worker count and memory exhaustion
>
> There's some sort of association between memory exhaustion and a high number of workers.  The poor man's fix would be to purchase new servers with higher CPU speeds as that would reduce the worker count.  Issues with high worker count and/or memory exhaustion appears to be a well know problem based on the mailing list archives.
>
> In the current version of bro-2.4 my previous configuration immediately causes the manager to crash: 15 proxies, 155 workers.  To resolve this I've lowered the count to 10 proxies and 140 workers.  However even with this configuration the manager process will exhaust all memory and crash within about 2 hours.
>
> The manager is threaded; I think this is an issue with the threading behavior between manager, proxies, and workers.  Debugging threading problems is complex and I'm a complete novice.. my current tutorial is using information from a stack overflow thread:
>
> http://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads
>
> Does anyone else have this problem ?  What have you tried and what do you suggest ?
>
> Thanks
>
>
>
>
> 1435347409.458185       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] peer sent class "control"
> 1435347409.458185       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] phase: handshake
> 1435347409.661085       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] request for unknown event save_results
> 1435347409.661085       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] registered for event Control::peer_status_response
> 1435347409.694858       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] peer does not support 64bit PIDs; using compatibility mode
> 1435347409.694858       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] peer is a Broccoli
> 1435347409.694858       worker-2-18     parent  -       -       -       info    [#10000/10.1.1.1:36994<http://10.1.1.1:36994>] phase: running
>
>
>
>
>
> _______________________________________________
> Bro mailing list
> bro at bro-ids.org<mailto:bro at bro-ids.org>
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20150701/fc310c0c/attachment-0001.html 


More information about the Bro mailing list