[Bro] Bro's limitations with high worker count and memory exhaustion

Gary Faulkner gfaulkner.nsm at gmail.com
Wed Jul 1 14:12:51 PDT 2015


Each worker should have its own prof.log under 
<path-before-bro>/bro/spool/<worker-name> as well on the actual hosts 
where the workers are running.

On 7/1/15 3:50 PM, Baxter Milliwew wrote:
> No.  The cluster is six 48-core/64GB servers with the manager as one and 3
> proxies and 28 workers for the others.  I enabled profiling but didn't see
> anything wrong.  The bro process that is consuming all memory is not the
> same process detailed by prof.log.
>
>
>
> On Wed, Jul 1, 2015 at 12:27 PM, Jan Grashofer <jan.grashofer at cern.ch>
> wrote:
>
>>   You are not trying to run 140 workers on a single machine with 64GB
>> memory, right?
>>
>>
>>   ------------------------------
>> *From:* Baxter Milliwew [baxter.milliwew at gmail.com]
>> *Sent:* Wednesday, July 01, 2015 20:39
>> *To:* Siwek, Jon
>> *Cc:* Jan Grashofer; bro at bro.org
>> *Subject:* Re: [Bro] Bro's limitations with high worker count and memory
>> exhaustion
>>
>>    Do you think a high worker count with the current implementation of
>> select() would cause high memory usage ?
>>
>>   I'm trying to figure out why the manager always exhausts all memory:
>>
>>    top - 18:36:13 up 1 day, 14:42,  1 user,  load average: 12.67, 10.83,
>> 10.95
>>
>> Tasks: 606 total,   5 running, 601 sleeping,   0 stopped,   0 zombie
>>
>> %Cpu(s): 15.3 us,  6.4 sy,  1.3 ni, 76.3 id,  0.0 wa,  0.0 hi,  0.7 si,
>> 0.0 st
>>
>> KiB Mem:  65939412 total, 65251768 used,   687644 free,    43248 buffers
>>
>> KiB Swap: 67076092 total, 54857880 used, 12218212 free.  4297048 cached
>> Mem
>>
>>
>>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>
>>
>>
>> 35046 logstash  20   0 10.320g 511600   3784 S 782.1  0.8   5386:34 java
>>
>>
>>
>>
>>   9925 bro       25   5 97.504g 0.045t   1508 R  99.7 73.8 814:58.88 bro
>>
>>
>>
>>
>>   9906 bro       20   0 22.140g 3.388g   3784 S  73.2  5.4   1899:18 bro
>>
>>
>>
>>
>>   2509 root      20   0  308440  44064    784 R  48.5  0.1   1029:56
>> redis-server
>>
>>
>>
>>   2688 bro       30  10    4604   1440   1144 R  44.8  0.0   0:00.49 gzip
>>
>>
>>
>>
>>    180 root      20   0       0      0      0 S   8.2  0.0   4:26.54
>> ksoftirqd/8
>>
>>
>>
>>   2419 debug     20   0   25376   3564   2600 R   7.3  0.0   0:00.76 top
>>
>>
>>
>>
>>   2689 logstash  20   0       8      4      0 R   5.5  0.0   0:00.06 bro
>>
>>
>>
>> On Tue, Jun 30, 2015 at 11:37 AM, Baxter Milliwew <
>> baxter.milliwew at gmail.com> wrote:
>>
>>> Thanks.  Some limited reading says it's not possible to increase
>>> FD_SETSIZE on linux and it's time to migrate to poll().
>>>
>>>
>>>
>>> On Tue, Jun 30, 2015 at 7:44 AM, Siwek, Jon <jsiwek at illinois.edu> wrote:
>>>
>>>> A guess is that you’re bumping into an FD_SETSIZE limit — the way remote
>>>> I/O is currently structured has at least 5 file descriptors per remote
>>>> connection from what I can see at a glance (a pair of pipes, 2 fds each,
>>>> for signaling read/write readiness related to ChunkedIO and one fd for the
>>>> actual socket).  Typically, FD_SETSIZE is 1024, so with ~150-200 remote
>>>> connections and 5 fds per connection plus whatever other descriptors Bro
>>>> may need to have open (e.g. for file I/O), it seems reasonable to guess
>>>> that’s the problem.  But you could easily verify w/ some code modifications
>>>> to check whether the FD_SET call is using a fd >= FD_SETSIZE.
>>>>
>>>> Other than making involved code changes to Bro (e.g. to move away from
>>>> select() for I/O event handling), the only suggestions I have are 1)
>>>> reducing number of remote connections 2) see if you can increase FD_SETSIZE
>>>> via preprocessor stuff or CFLAGS/CXXFLAGS upon ./configure’ing (I’ve never
>>>> done this myself to know if it works, but I’ve googled around before and
>>>> think the implication was that it may work on Linux).
>>>>
>>>> - Jon
>>>>
>>>>> On Jun 29, 2015, at 6:22 PM, Baxter Milliwew <
>>>> baxter.milliwew at gmail.com> wrote:
>>>>> The manager still crashes.  Interesting note about a buffer overflow.
>>>>>
>>>>>
>>>>> [manager]
>>>>>
>>>>> Bro 2.4
>>>>> Linux 3.16.0-38-generic
>>>>>
>>>>> core
>>>>> [New LWP 18834]
>>>>> [Thread debugging using libthread_db enabled]
>>>>> Using host libthread_db library
>>>> "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>>>> Core was generated by `/usr/local/3rd-party/bro/bin/bro -U .status -p
>>>> broctl -p broctl-live -p local -'.
>>>>> Program terminated with signal SIGABRT, Aborted.
>>>>> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>>>> Thread 1 (Thread 0x............ (LWP 18834)):
>>>>> #0  0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>>>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>>>>> #1  0x00007f163bb4a0d8 in __GI_abort () at abort.c:89
>>>>> #2  0x00007f163bb83394 in __libc_message (do_abort=do_abort at entry=2,
>>>> fmt=fmt at entry=0x............ "*** %s ***: %s terminated\n") at
>>>> ../sysdeps/posix/libc_fatal.c:175
>>>>> #3  0x00007f163bc1ac9c in __GI___fortify_fail (msg=<optimized out>,
>>>> msg at entry=0x............ "buffer overflow detected") at
>>>> fortify_fail.c:37
>>>>> #4  0x00007f163bc19b60 in __GI___chk_fail () at chk_fail.c:28
>>>>> #5  0x00007f163bc1abe7 in __fdelt_chk (d=<optimized out>) at
>>>> fdelt_chk.c:25
>>>>> #6  0x00000000005e962a in Set (set=0x............,
>>>> this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/iosource/FD_Set.h:59
>>>>> #7  SocketComm::Run (this=0x............) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:3406
>>>>> #8  0x00000000005e9c31 in RemoteSerializer::Fork (this=0x............)
>>>> at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:687
>>>>> #9  0x00000000005e9d4f in RemoteSerializer::Enable
>>>> (this=0x............) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:575
>>>>> #10 0x00000000005b6943 in BifFunc::bro_enable_communication
>>>> (frame=<optimized out>, BiF_ARGS=<optimized out>) at bro.bif:4480
>>>>> #11 0x00000000005b431d in BuiltinFunc::Call (this=0x............,
>>>> args=0x............, parent=0x............) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:586
>>>>> #12 0x0000000000599066 in CallExpr::Eval (this=0x............,
>>>> f=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Expr.cc:4544
>>>>> #13 0x000000000060ceb4 in ExprStmt::Exec (this=0x............,
>>>> f=0x............, flow=@0x............: FLOW_NEXT) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:352
>>>>> #14 0x000000000060b174 in IfStmt::DoExec (this=0x............,
>>>> f=0x............, v=<optimized out>, flow=@0x............: FLOW_NEXT) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:456
>>>>> #15 0x000000000060ced1 in ExprStmt::Exec (this=0x............,
>>>> f=0x............, flow=@0x............: FLOW_NEXT) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:356
>>>>> #16 0x000000000060b211 in StmtList::Exec (this=0x............,
>>>> f=0x............, flow=@0x............: FLOW_NEXT) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>>>>> #17 0x000000000060b211 in StmtList::Exec (this=0x............,
>>>> f=0x............, flow=@0x............: FLOW_NEXT) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>>>>> #18 0x00000000005c042e in BroFunc::Call (this=0x............,
>>>> args=<optimized out>, parent=0x0) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:403
>>>>> #19 0x000000000057ee2a in EventHandler::Call (this=0x............,
>>>> vl=0x............, no_remote=no_remote at entry=false) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/EventHandler.cc:130
>>>>> #20 0x000000000057e035 in Dispatch (no_remote=false,
>>>> this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Event.h:50
>>>>> #21 EventMgr::Dispatch (this=this at entry=0x...... <mgr>) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Event.cc:111
>>>>> #22 0x000000000057e1d0 in EventMgr::Drain (this=0xbbd720 <mgr>) at
>>>> /home/bro/Bro-IDS/bro-2.4/src/Event.cc:128
>>>>> #23 0x00000000005300ed in main (argc=<optimized out>, argv=<optimized
>>>> out>) at /home/bro/Bro-IDS/bro-2.4/src/main.cc:1147
>>>>>
>>>>>
>>>>> On Mon, Jun 29, 2015 at 4:09 PM, Baxter Milliwew <
>>>> baxter.milliwew at gmail.com> wrote:
>>>>> Nevermind... new box, default nofile limits.  Thanks for the malloc
>>>> tip.
>>>>>
>>>>> On Mon, Jun 29, 2015 at 4:03 PM, Baxter Milliwew <
>>>> baxter.milliwew at gmail.com> wrote:
>>>>> Switching to jemalloc fixed the stability issue but not the worker
>>>> count limitation.
>>>>> On Sun, Jun 28, 2015 at 7:18 PM, Baxter Milliwew <
>>>> baxter.milliwew at gmail.com> wrote:
>>>>> Looks like malloc from glibc, default on Ubuntu.  I will try jemalloc
>>>> and others.
>>>>>
>>>>>
>>>>> On Sun, Jun 28, 2015 at 1:03 AM, Jan Grashofer <jan.grashofer at cern.ch>
>>>> wrote:
>>>>> I experienced similar problems (memory gets eaten up quickly and
>>>> workers crash with segfault) using tcmalloc. Which malloc do you use?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Jan
>>>>>
>>>>>
>>>>   > From: bro-bounces at bro.org [bro-bounces at bro.org] on behalf of Baxter
>>>> Milliwew [baxter.milliwew at gmail.com]
>>>>> Sent: Friday, June 26, 2015 23:03
>>>>> To: bro at bro.org
>>>>> Subject: [Bro] Bro's limitations with high worker count and memory
>>>> exhaustion
>>>>> There's some sort of association between memory exhaustion and a high
>>>> number of workers.  The poor man's fix would be to purchase new servers
>>>> with higher CPU speeds as that would reduce the worker count.  Issues with
>>>> high worker count and/or memory exhaustion appears to be a well know
>>>> problem based on the mailing list archives.
>>>>> In the current version of bro-2.4 my previous configuration
>>>> immediately causes the manager to crash: 15 proxies, 155 workers.  To
>>>> resolve this I've lowered the count to 10 proxies and 140 workers.  However
>>>> even with this configuration the manager process will exhaust all memory
>>>> and crash within about 2 hours.
>>>>> The manager is threaded; I think this is an issue with the threading
>>>> behavior between manager, proxies, and workers.  Debugging threading
>>>> problems is complex and I'm a complete novice.. my current tutorial is
>>>> using information from a stack overflow thread:
>>>>>
>>>> http://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads
>>>>> Does anyone else have this problem ?  What have you tried and what do
>>>> you suggest ?
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 1435347409.458185       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] peer sent class "control"
>>>>> 1435347409.458185       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] phase: handshake
>>>>> 1435347409.661085       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] request for unknown event save_results
>>>>> 1435347409.661085       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] registered for event
>>>> Control::peer_status_response
>>>>> 1435347409.694858       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] peer does not support 64bit PIDs;
>>>> using compatibility mode
>>>>> 1435347409.694858       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] peer is a Broccoli
>>>>> 1435347409.694858       worker-2-18     parent  -       -       -
>>>>   info    [#10000/10.1.1.1:36994] phase: running
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bro mailing list
>>>>> bro at bro-ids.org
>>>>> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>>>>
>
>
> _______________________________________________
> Bro mailing list
> bro at bro-ids.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20150701/7f04e151/attachment-0001.html 


More information about the Bro mailing list