[Bro] Bro's limitations with high worker count and memory exhaustion

Wed Jul 1 15:00:29 PDT 2015

Ah, fair enough. I was potentially one of those people who struggled 
with memory exhaustion in the past, but the issue ended up largely being 
due to memory leaks in various Bro builds that existed between Bro 2.2 
and 2.3. I haven't seen similar leaks since 2.3 unless I made a mistake 
in a script such as not validating that Bro populated a value. The only 
other issue I've seen really tended to crash the proxies (or have them 
oom-killed), not the manager. The latter was also due to a form of 
communication overload, but involved software asset tracking, high IP 
turn over, and lots of table updates getting shared out to the cluster.

On 7/1/15 4:24 PM, Baxter Milliwew wrote:
> Ok, but the problem isn't with a worker it's with the secondary manager
> process that collects the logs from workers.
>
> I think the memory exhaustion is related to worker count only in the sense
> that more threads (workers) is causing more frequent leaks.  I read
> accounts of others on this list restarting the manager once per month or
> something.. I'm thinking it's the same bug (mem leak) but with a lower
> worker count.
>
>
>
>
> On Wed, Jul 1, 2015 at 2:12 PM, Gary Faulkner <gfaulkner.nsm at gmail.com>
> wrote:
>
>>   Each worker should have its own prof.log under
>> <path-before-bro>/bro/spool/<worker-name> as well on the actual hosts where
>> the workers are running.
>>
>> On 7/1/15 3:50 PM, Baxter Milliwew wrote:
>>
>> No.  The cluster is six 48-core/64GB servers with the manager as one and 3
>> proxies and 28 workers for the others.  I enabled profiling but didn't see
>> anything wrong.  The bro process that is consuming all memory is not the
>> same process detailed by prof.log.
>>
>>
>>
>> On Wed, Jul 1, 2015 at 12:27 PM, Jan Grashofer <jan.grashofer at cern.ch> <jan.grashofer at cern.ch>
>> wrote:
>>
>>
>>    You are not trying to run 140 workers on a single machine with 64GB
>> memory, right?
>>
>>
>>   ------------------------------
>> *From:* Baxter Milliwew [baxter.milliwew at gmail.com]
>> *Sent:* Wednesday, July 01, 2015 20:39
>> *To:* Siwek, Jon
>> *Cc:* Jan Grashofer; bro at bro.org
>> *Subject:* Re: [Bro] Bro's limitations with high worker count and memory
>> exhaustion
>>
>>    Do you think a high worker count with the current implementation of
>> select() would cause high memory usage ?
>>
>>   I'm trying to figure out why the manager always exhausts all memory:
>>
>>    top - 18:36:13 up 1 day, 14:42,  1 user,  load average: 12.67, 10.83,
>> 10.95
>>
>> Tasks: 606 total,   5 running, 601 sleeping,   0 stopped,   0 zombie
>>
>> %Cpu(s): 15.3 us,  6.4 sy,  1.3 ni, 76.3 id,  0.0 wa,  0.0 hi,  0.7 si,
>> 0.0 st
>>
>> KiB Mem:  65939412 total, 65251768 used,   687644 free,    43248 buffers
>>
>> KiB Swap: 67076092 total, 54857880 used, 12218212 free.  4297048 cached
>> Mem
>>
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>
>>
>>
>> 35046 logstash  20   0 10.320g 511600   3784 S 782.1  0.8   5386:34 java
>>
>>
>>
>>
>>   9925 bro       25   5 97.504g 0.045t   1508 R  99.7 73.8 814:58.88 bro
>>
>>
>>
>>
>>   9906 bro       20   0 22.140g 3.388g   3784 S  73.2  5.4   1899:18 bro
>>
>>
>>
>>
>>   2509 root      20   0  308440  44064    784 R  48.5  0.1   1029:56
>> redis-server
>>
>>
>>
>>   2688 bro       30  10    4604   1440   1144 R  44.8  0.0   0:00.49 gzip
>>
>>
>>
>>
>>    180 root      20   0       0      0      0 S   8.2  0.0   4:26.54
>> ksoftirqd/8
>>
>>
>>
>>   2419 debug     20   0   25376   3564   2600 R   7.3  0.0   0:00.76 top
>>
>>
>>
>>
>>   2689 logstash  20   0       8      4      0 R   5.5  0.0   0:00.06 bro
>>
>>
>>
>> On Tue, Jun 30, 2015 at 11:37 AM, Baxter Milliwew <baxter.milliwew at gmail.com> wrote:
>>
>>
>>   Thanks.  Some limited reading says it's not possible to increase
>> FD_SETSIZE on linux and it's time to migrate to poll().
>>
>>
>>
>> On Tue, Jun 30, 2015 at 7:44 AM, Siwek, Jon <jsiwek at illinois.edu> <jsiwek at illinois.edu> wrote:
>>
>>
>>   A guess is that you’re bumping into an FD_SETSIZE limit — the way remote
>> I/O is currently structured has at least 5 file descriptors per remote
>> connection from what I can see at a glance (a pair of pipes, 2 fds each,
>> for signaling read/write readiness related to ChunkedIO and one fd for the
>> actual socket).  Typically, FD_SETSIZE is 1024, so with ~150-200 remote
>> connections and 5 fds per connection plus whatever other descriptors Bro
>> may need to have open (e.g. for file I/O), it seems reasonable to guess
>> that’s the problem.  But you could easily verify w/ some code modifications
>> to check whether the FD_SET call is using a fd >= FD_SETSIZE.
>>
>> Other than making involved code changes to Bro (e.g. to move away from
>> select() for I/O event handling), the only suggestions I have are 1)
>> reducing number of remote connections 2) see if you can increase FD_SETSIZE
>> via preprocessor stuff or CFLAGS/CXXFLAGS upon ./configure’ing (I’ve never
>> done this myself to know if it works, but I’ve googled around before and
>> think the implication was that it may work on Linux).
>>
>> - Jon
>>
>>
>>   On Jun 29, 2015, at 6:22 PM, Baxter Milliwew <
>>
>>   baxter.milliwew at gmail.com> wrote:
>>
>>   The manager still crashes.  Interesting note about a buffer overflow.
>>
>>
>> [manager]
>>
>> Bro 2.4
>> Linux 3.16.0-38-generic
>>
>> core
>> [New LWP 18834]
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library
>>
>>   "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>
>>   Core was generated by `/usr/local/3rd-party/bro/bin/bro -U .status -p
>>
>>   broctl -p broctl-live -p local -'.
>>
>>   Program terminated with signal SIGABRT, Aborted.
>> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>>
>>   ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>
>>   Thread 1 (Thread 0x............ (LWP 18834)):
>> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>>
>>   ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>
>>   #1  0x00007f163bb4a0d8 in __GI_abort () at abort.c:89
>> #2  0x00007f163bb83394 in __libc_message (do_abort=do_abort at entry=2,
>>
>>   fmt=fmt at entry=0x............ "*** %s ***: %s terminated\n") at
>> ../sysdeps/posix/libc_fatal.c:175
>>
>>   #3  0x00007f163bc1ac9c in __GI___fortify_fail (msg=<optimized out>,
>>
>>   msg at entry=0x............ "buffer overflow detected") at
>> fortify_fail.c:37
>>
>>   #4  0x00007f163bc19b60 in __GI___chk_fail () at chk_fail.c:28
>> #5  0x00007f163bc1abe7 in __fdelt_chk (d=<optimized out>) at
>>
>>   fdelt_chk.c:25
>>
>>   #6  0x00000000005e962a in Set (set=0x............,
>>
>>   this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/iosource/FD_Set.h:59
>>
>>   #7  SocketComm::Run (this=0x............) at
>>
>>   /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:3406
>>
>>   #8  0x00000000005e9c31 in RemoteSerializer::Fork (this=0x............)
>>
>>   at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:687
>>
>>   #9  0x00000000005e9d4f in RemoteSerializer::Enable
>>
>>   (this=0x............) at
>> /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:575
>>
>>   #10 0x00000000005b6943 in BifFunc::bro_enable_communication
>>
>>   (frame=<optimized out>, BiF_ARGS=<optimized out>) at bro.bif:4480
>>
>>   #11 0x00000000005b431d in BuiltinFunc::Call (this=0x............,
>>
>>   args=0x............, parent=0x............) at
>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:586
>>
>>   #12 0x0000000000599066 in CallExpr::Eval (this=0x............,
>>
>>   f=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Expr.cc:4544
>>
>>   #13 0x000000000060ceb4 in ExprStmt::Exec (this=0x............,
>>
>>   f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:352
>>
>>   #14 0x000000000060b174 in IfStmt::DoExec (this=0x............,
>>
>>   f=0x............, v=<optimized out>, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:456
>>
>>   #15 0x000000000060ced1 in ExprStmt::Exec (this=0x............,
>>
>>   f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:356
>>
>>   #16 0x000000000060b211 in StmtList::Exec (this=0x............,
>>
>>   f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>>
>>   #17 0x000000000060b211 in StmtList::Exec (this=0x............,
>>
>>   f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>>
>>   #18 0x00000000005c042e in BroFunc::Call (this=0x............,
>>
>>   args=<optimized out>, parent=0x0) at
>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:403
>>
>>   #19 0x000000000057ee2a in EventHandler::Call (this=0x............,
>>
>>   vl=0x............, no_remote=no_remote at entry=false) at
>> /home/bro/Bro-IDS/bro-2.4/src/EventHandler.cc:130
>>
>>   #20 0x000000000057e035 in Dispatch (no_remote=false,
>>
>>   this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Event.h:50
>>
>>   #21 EventMgr::Dispatch (this=this at entry=0x...... <mgr>) at
>>
>>   /home/bro/Bro-IDS/bro-2.4/src/Event.cc:111
>>
>>   #22 0x000000000057e1d0 in EventMgr::Drain (this=0xbbd720 <mgr>) at
>>
>>   /home/bro/Bro-IDS/bro-2.4/src/Event.cc:128
>>
>>   #23 0x00000000005300ed in main (argc=<optimized out>, argv=<optimized
>>
>>   out>) at /home/bro/Bro-IDS/bro-2.4/src/main.cc:1147
>>
>>   On Mon, Jun 29, 2015 at 4:09 PM, Baxter Milliwew <
>>
>>   baxter.milliwew at gmail.com> wrote:
>>
>>   Nevermind... new box, default nofile limits.  Thanks for the malloc
>>
>>   tip.
>>
>>   On Mon, Jun 29, 2015 at 4:03 PM, Baxter Milliwew <
>>
>>   baxter.milliwew at gmail.com> wrote:
>>
>>   Switching to jemalloc fixed the stability issue but not the worker
>>
>>   count limitation.
>>
>>   On Sun, Jun 28, 2015 at 7:18 PM, Baxter Milliwew <
>>
>>   baxter.milliwew at gmail.com> wrote:
>>
>>   Looks like malloc from glibc, default on Ubuntu.  I will try jemalloc
>>
>>   and others.
>>
>>   On Sun, Jun 28, 2015 at 1:03 AM, Jan Grashofer <jan.grashofer at cern.ch> <jan.grashofer at cern.ch>
>>
>>   wrote:
>>
>>   I experienced similar problems (memory gets eaten up quickly and
>>
>>   workers crash with segfault) using tcmalloc. Which malloc do you use?
>>
>>   Regards,
>>
>> Jan
>>
>>
>>
>>    > From: bro-bounces at bro.org [bro-bounces at bro.org] on behalf of Baxter
>> Milliwew [baxter.milliwew at gmail.com]
>>
>>   Sent: Friday, June 26, 2015 23:03
>> To: bro at bro.org
>> Subject: [Bro] Bro's limitations with high worker count and memory
>>
>>   exhaustion
>>
>>   There's some sort of association between memory exhaustion and a high
>>
>>   number of workers.  The poor man's fix would be to purchase new servers
>> with higher CPU speeds as that would reduce the worker count.  Issues with
>> high worker count and/or memory exhaustion appears to be a well know
>> problem based on the mailing list archives.
>>
>>   In the current version of bro-2.4 my previous configuration
>>
>>   immediately causes the manager to crash: 15 proxies, 155 workers.  To
>> resolve this I've lowered the count to 10 proxies and 140 workers.  However
>> even with this configuration the manager process will exhaust all memory
>> and crash within about 2 hours.
>>
>>   The manager is threaded; I think this is an issue with the threading
>>
>>   behavior between manager, proxies, and workers.  Debugging threading
>> problems is complex and I'm a complete novice.. my current tutorial is
>> using information from a stack overflow thread:
>>
>>    http://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads
>>
>>   Does anyone else have this problem ?  What have you tried and what do
>>
>>   you suggest ?
>>
>>   Thanks
>>
>>
>>
>>
>> 1435347409.458185       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] peer sent class "control"
>>
>>   1435347409.458185       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] phase: handshake
>>
>>   1435347409.661085       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] request for unknown event save_results
>>
>>   1435347409.661085       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] registered for event
>> Control::peer_status_response
>>
>>   1435347409.694858       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] peer does not support 64bit PIDs;
>> using compatibility mode
>>
>>   1435347409.694858       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] peer is a Broccoli
>>
>>   1435347409.694858       worker-2-18     parent  -       -       -
>>
>>    info    [#10000/10.1.1.1:36994] phase: running
>>
>>
>>
>> _______________________________________________
>> Bro mailing listbro at bro-ids.orghttp://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>>
>>
>>
>> _______________________________________________
>> Bro mailing listbro at bro-ids.orghttp://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>>
>>
>>