[Bro] Bro's limitations with high worker count and memory exhaustion
Baxter Milliwew
baxter.milliwew at gmail.com
Wed Jul 1 11:39:09 PDT 2015
Do you think a high worker count with the current implementation of
select() would cause high memory usage ?
I'm trying to figure out why the manager always exhausts all memory:
top - 18:36:13 up 1 day, 14:42, 1 user, load average: 12.67, 10.83, 10.95
Tasks: 606 total, 5 running, 601 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.3 us, 6.4 sy, 1.3 ni, 76.3 id, 0.0 wa, 0.0 hi, 0.7 si,
0.0 st
KiB Mem: 65939412 total, 65251768 used, 687644 free, 43248 buffers
KiB Swap: 67076092 total, 54857880 used, 12218212 free. 4297048 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND
35046 logstash 20 0 10.320g 511600 3784 S 782.1 0.8 5386:34 java
9925 bro 25 5 97.504g 0.045t 1508 R 99.7 73.8 814:58.88 bro
9906 bro 20 0 22.140g 3.388g 3784 S 73.2 5.4 1899:18 bro
2509 root 20 0 308440 44064 784 R 48.5 0.1 1029:56
redis-server
2688 bro 30 10 4604 1440 1144 R 44.8 0.0 0:00.49 gzip
180 root 20 0 0 0 0 S 8.2 0.0 4:26.54
ksoftirqd/8
2419 debug 20 0 25376 3564 2600 R 7.3 0.0 0:00.76 top
2689 logstash 20 0 8 4 0 R 5.5 0.0 0:00.06 bro
On Tue, Jun 30, 2015 at 11:37 AM, Baxter Milliwew <baxter.milliwew at gmail.com
> wrote:
> Thanks. Some limited reading says it's not possible to increase
> FD_SETSIZE on linux and it's time to migrate to poll().
>
>
>
> On Tue, Jun 30, 2015 at 7:44 AM, Siwek, Jon <jsiwek at illinois.edu> wrote:
>
>> A guess is that you’re bumping into an FD_SETSIZE limit — the way remote
>> I/O is currently structured has at least 5 file descriptors per remote
>> connection from what I can see at a glance (a pair of pipes, 2 fds each,
>> for signaling read/write readiness related to ChunkedIO and one fd for the
>> actual socket). Typically, FD_SETSIZE is 1024, so with ~150-200 remote
>> connections and 5 fds per connection plus whatever other descriptors Bro
>> may need to have open (e.g. for file I/O), it seems reasonable to guess
>> that’s the problem. But you could easily verify w/ some code modifications
>> to check whether the FD_SET call is using a fd >= FD_SETSIZE.
>>
>> Other than making involved code changes to Bro (e.g. to move away from
>> select() for I/O event handling), the only suggestions I have are 1)
>> reducing number of remote connections 2) see if you can increase FD_SETSIZE
>> via preprocessor stuff or CFLAGS/CXXFLAGS upon ./configure’ing (I’ve never
>> done this myself to know if it works, but I’ve googled around before and
>> think the implication was that it may work on Linux).
>>
>> - Jon
>>
>> > On Jun 29, 2015, at 6:22 PM, Baxter Milliwew <baxter.milliwew at gmail.com>
>> wrote:
>> >
>> > The manager still crashes. Interesting note about a buffer overflow.
>> >
>> >
>> > [manager]
>> >
>> > Bro 2.4
>> > Linux 3.16.0-38-generic
>> >
>> > core
>> > [New LWP 18834]
>> > [Thread debugging using libthread_db enabled]
>> > Using host libthread_db library
>> "/lib/x86_64-linux-gnu/libthread_db.so.1".
>> > Core was generated by `/usr/local/3rd-party/bro/bin/bro -U .status -p
>> broctl -p broctl-live -p local -'.
>> > Program terminated with signal SIGABRT, Aborted.
>> > #0 0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> >
>> > Thread 1 (Thread 0x............ (LWP 18834)):
>> > #0 0x00007f163bb46cc9 in __GI_raise (sig=sig at entry=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:56
>> > #1 0x00007f163bb4a0d8 in __GI_abort () at abort.c:89
>> > #2 0x00007f163bb83394 in __libc_message (do_abort=do_abort at entry=2,
>> fmt=fmt at entry=0x............ "*** %s ***: %s terminated\n") at
>> ../sysdeps/posix/libc_fatal.c:175
>> > #3 0x00007f163bc1ac9c in __GI___fortify_fail (msg=<optimized out>,
>> msg at entry=0x............ "buffer overflow detected") at fortify_fail.c:37
>> > #4 0x00007f163bc19b60 in __GI___chk_fail () at chk_fail.c:28
>> > #5 0x00007f163bc1abe7 in __fdelt_chk (d=<optimized out>) at
>> fdelt_chk.c:25
>> > #6 0x00000000005e962a in Set (set=0x............, this=0x............)
>> at /home/bro/Bro-IDS/bro-2.4/src/iosource/FD_Set.h:59
>> > #7 SocketComm::Run (this=0x............) at
>> /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:3406
>> > #8 0x00000000005e9c31 in RemoteSerializer::Fork (this=0x............)
>> at /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:687
>> > #9 0x00000000005e9d4f in RemoteSerializer::Enable
>> (this=0x............) at
>> /home/bro/Bro-IDS/bro-2.4/src/RemoteSerializer.cc:575
>> > #10 0x00000000005b6943 in BifFunc::bro_enable_communication
>> (frame=<optimized out>, BiF_ARGS=<optimized out>) at bro.bif:4480
>> > #11 0x00000000005b431d in BuiltinFunc::Call (this=0x............,
>> args=0x............, parent=0x............) at
>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:586
>> > #12 0x0000000000599066 in CallExpr::Eval (this=0x............,
>> f=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Expr.cc:4544
>> > #13 0x000000000060ceb4 in ExprStmt::Exec (this=0x............,
>> f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:352
>> > #14 0x000000000060b174 in IfStmt::DoExec (this=0x............,
>> f=0x............, v=<optimized out>, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:456
>> > #15 0x000000000060ced1 in ExprStmt::Exec (this=0x............,
>> f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:356
>> > #16 0x000000000060b211 in StmtList::Exec (this=0x............,
>> f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>> > #17 0x000000000060b211 in StmtList::Exec (this=0x............,
>> f=0x............, flow=@0x............: FLOW_NEXT) at
>> /home/bro/Bro-IDS/bro-2.4/src/Stmt.cc:1696
>> > #18 0x00000000005c042e in BroFunc::Call (this=0x............,
>> args=<optimized out>, parent=0x0) at
>> /home/bro/Bro-IDS/bro-2.4/src/Func.cc:403
>> > #19 0x000000000057ee2a in EventHandler::Call (this=0x............,
>> vl=0x............, no_remote=no_remote at entry=false) at
>> /home/bro/Bro-IDS/bro-2.4/src/EventHandler.cc:130
>> > #20 0x000000000057e035 in Dispatch (no_remote=false,
>> this=0x............) at /home/bro/Bro-IDS/bro-2.4/src/Event.h:50
>> > #21 EventMgr::Dispatch (this=this at entry=0x...... <mgr>) at
>> /home/bro/Bro-IDS/bro-2.4/src/Event.cc:111
>> > #22 0x000000000057e1d0 in EventMgr::Drain (this=0xbbd720 <mgr>) at
>> /home/bro/Bro-IDS/bro-2.4/src/Event.cc:128
>> > #23 0x00000000005300ed in main (argc=<optimized out>, argv=<optimized
>> out>) at /home/bro/Bro-IDS/bro-2.4/src/main.cc:1147
>> >
>> >
>> >
>> > On Mon, Jun 29, 2015 at 4:09 PM, Baxter Milliwew <
>> baxter.milliwew at gmail.com> wrote:
>> > Nevermind... new box, default nofile limits. Thanks for the malloc tip.
>> >
>> >
>> > On Mon, Jun 29, 2015 at 4:03 PM, Baxter Milliwew <
>> baxter.milliwew at gmail.com> wrote:
>> > Switching to jemalloc fixed the stability issue but not the worker
>> count limitation.
>> >
>> > On Sun, Jun 28, 2015 at 7:18 PM, Baxter Milliwew <
>> baxter.milliwew at gmail.com> wrote:
>> > Looks like malloc from glibc, default on Ubuntu. I will try jemalloc
>> and others.
>> >
>> >
>> >
>> > On Sun, Jun 28, 2015 at 1:03 AM, Jan Grashofer <jan.grashofer at cern.ch>
>> wrote:
>> > I experienced similar problems (memory gets eaten up quickly and
>> workers crash with segfault) using tcmalloc. Which malloc do you use?
>> >
>> >
>> > Regards,
>> >
>> > Jan
>> >
>> >
>> > From: bro-bounces at bro.org [bro-bounces at bro.org] on behalf of Baxter
>> Milliwew [baxter.milliwew at gmail.com]
>> > Sent: Friday, June 26, 2015 23:03
>> > To: bro at bro.org
>> > Subject: [Bro] Bro's limitations with high worker count and memory
>> exhaustion
>> >
>> > There's some sort of association between memory exhaustion and a high
>> number of workers. The poor man's fix would be to purchase new servers
>> with higher CPU speeds as that would reduce the worker count. Issues with
>> high worker count and/or memory exhaustion appears to be a well know
>> problem based on the mailing list archives.
>> >
>> > In the current version of bro-2.4 my previous configuration immediately
>> causes the manager to crash: 15 proxies, 155 workers. To resolve this I've
>> lowered the count to 10 proxies and 140 workers. However even with this
>> configuration the manager process will exhaust all memory and crash within
>> about 2 hours.
>> >
>> > The manager is threaded; I think this is an issue with the threading
>> behavior between manager, proxies, and workers. Debugging threading
>> problems is complex and I'm a complete novice.. my current tutorial is
>> using information from a stack overflow thread:
>> >
>> >
>> http://stackoverflow.com/questions/981011/c-programming-debugging-with-pthreads
>> >
>> > Does anyone else have this problem ? What have you tried and what do
>> you suggest ?
>> >
>> > Thanks
>> >
>> >
>> >
>> >
>> > 1435347409.458185 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] peer sent class "control"
>> > 1435347409.458185 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] phase: handshake
>> > 1435347409.661085 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] request for unknown event save_results
>> > 1435347409.661085 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] registered for event
>> Control::peer_status_response
>> > 1435347409.694858 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] peer does not support 64bit PIDs; using
>> compatibility mode
>> > 1435347409.694858 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] peer is a Broccoli
>> > 1435347409.694858 worker-2-18 parent - - -
>> info [#10000/10.1.1.1:36994] phase: running
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Bro mailing list
>> > bro at bro-ids.org
>> > http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20150701/59e204cb/attachment-0001.html
More information about the Bro
mailing list