[Bro] Bro's limitations with high worker count and memory exhaustion

Baxter Milliwew baxter.milliwew at gmail.com
Thu Jul 2 15:09:54 PDT 2015


On Thu, Jul 2, 2015 at 6:55 AM, Siwek, Jon <jsiwek at illinois.edu> wrote:

>
> I’d guess the slowness of disk I/O is more of a contributing factor — a
> general problem is that if the manager can’t consume logs (i.e. write them
> to disk) at a greater or equal rate that workers produce them, then the
> manager buffers them up until out of memory.  If you enable profiling, the
> prof.log would contain details to indicate whether this is the situation.
>


This is something I checked before and it seemed a non-issue.  I'm using a
ram backed fs and rotating logs every 5 minutes.  With prof.log the pending
counter is always 0/0.  Is there something else I should be looking for ?


/**

         * Statistics about inter-thread communication.

         */

        struct Stats

                {

                uint64_t sent_in;       //! Number of messages sent to the
child thread.

                uint64_t sent_out;      //! Number of messages sent from
the child thread to the main thread

                uint64_t pending_in;    //! Number of messages sent to the
child but not yet processed.

                uint64_t pending_out;   //! Number of messages sent from
the child but not yet processed by the main thread.


                /// Statistics from our queues.

                Queue<BasicInputMessage *>::Stats  queue_in_stats;

                Queue<BasicOutputMessage *>::Stats queue_out_stats;

                };




1435868941.791125 TCP-States:Fin.

1435868941.791125 TCP-States:Rst.

1435868941.791125 Connections expired due to inactivity: 0

1435868941.791125 Total reassembler data: 0K

1435868941.791125 Timers: current=69 max=71 mem=5K lag=505.67s

1435868941.791125 DNS_Mgr: requests=0 succesful=0 failed=0 pending=0
cached_hosts=0 cached_addrs=0

1435868941.791125 Triggers: total=0 pending=0

1435868941.791125         RotateTimer = 34

1435868941.791125         ScheduleTimer = 10

1435868941.791125         TableValTimer = 25

1435868941.791125 Threads: current=34

1435868941.791125   loaded_scripts/Log::WRITER_ASCII in=114 out=112
pending=0/0 (#queue r/w: in=114/114 out=112/112)

1435868941.791125   communication/Log::WRITER_ASCII in=147 out=112
pending=0/0 (#queue r/w: in=147/147 out=112/112)

1435868941.791125   files/Log::WRITER_ASCII   in=10522 out=54 pending=0/0
(#queue r/w: in=10522/10522 out=54/54)

1435868941.791125   reporter/Log::WRITER_ASCII in=82 out=54 pending=0/0
(#queue r/w: in=82/82 out=54/54)

1435868941.791125   weird/Log::WRITER_ASCII   in=1083 out=54 pending=0/0
(#queue r/w: in=1083/1083 out=54/54)

1435868941.791125   packet_filter/Log::WRITER_ASCII in=55 out=54
pending=0/0 (#queue r/w: in=55/55 out=54/54)

1435868941.791125   dpd/Log::WRITER_ASCII     in=180 out=54 pending=0/0
(#queue r/w: in=180/180 out=54/54)

1435868941.791125   software/Log::WRITER_ASCII in=404 out=54 pending=0/0
(#queue r/w: in=404/404 out=54/54)

1435868941.791125   tunnel/Log::WRITER_ASCII  in=109 out=54 pending=0/0
(#queue r/w: in=109/109 out=54/54)

1435868941.791125   conn/Log::WRITER_ASCII    in=20583 out=54 pending=0/0
(#queue r/w: in=20583/20583 out=54/54)

1435868941.791125   dns/Log::WRITER_ASCII     in=4369 out=54 pending=0/0
(#queue r/w: in=4369/4369 out=54/54)

1435868941.791125   ftp/Log::WRITER_ASCII     in=94 out=54 pending=0/0
(#queue r/w: in=94/94 out=54/54)

1435868941.791125   ssl/Log::WRITER_ASCII     in=2566 out=54 pending=0/0
(#queue r/w: in=2566/2566 out=54/54)

1435868941.791125   x509/Log::WRITER_ASCII    in=1625 out=54 pending=0/0
(#queue r/w: in=1625/1625 out=54/54)

1435868941.791125   http/Log::WRITER_ASCII    in=7894 out=54 pending=0/0
(#queue r/w: in=7894/7894 out=54/54)

1435868941.791125   irc/Log::WRITER_ASCII     in=57 out=54 pending=0/0
(#queue r/w: in=57/57 out=54/54)

1435868941.791125   kerberos/Log::WRITER_ASCII in=65 out=54 pending=0/0
(#queue r/w: in=65/65 out=54/54)

1435868941.791125   mysql/Log::WRITER_ASCII   in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)

1435868941.791125   rdp/Log::WRITER_ASCII     in=140 out=54 pending=0/0
(#queue r/w: in=140/140 out=54/54)

1435868941.791125   sip/Log::WRITER_ASCII     in=108 out=54 pending=0/0
(#queue r/w: in=108/108 out=54/54)

1435868941.791125   snmp/Log::WRITER_ASCII    in=104 out=54 pending=0/0
(#queue r/w: in=104/104 out=54/54)

1435868941.791125   smtp/Log::WRITER_ASCII    in=232 out=54 pending=0/0
(#queue r/w: in=232/232 out=54/54)

1435868941.791125   socks/Log::WRITER_ASCII   in=69 out=54 pending=0/0
(#queue r/w: in=69/69 out=54/54)

1435868941.791125   ssh/Log::WRITER_ASCII     in=167 out=54 pending=0/0
(#queue r/w: in=167/167 out=54/54)

1435868941.791125   pe/Log::WRITER_ASCII      in=78 out=54 pending=0/0
(#queue r/w: in=78/78 out=54/54)

1435868941.791125   capture_loss/Log::WRITER_ASCII in=56 out=54 pending=0/0
(#queue r/w: in=56/56 out=54/54)

1435868941.791125   known_hosts/Log::WRITER_ASCII in=137 out=54 pending=0/0
(#queue r/w: in=137/137 out=54/54)

1435868941.791125   known_services/Log::WRITER_ASCII in=108 out=54
pending=0/0 (#queue r/w: in=108/108 out=54/54)

1435868941.791125   known_certs/Log::WRITER_ASCII in=77 out=54 pending=0/0
(#queue r/w: in=77/77 out=54/54)

1435868941.791125   radius/Log::WRITER_ASCII  in=78 out=53 pending=0/0
(#queue r/w: in=78/78 out=53/53)

1435868941.791125   dhcp/Log::WRITER_ASCII    in=53 out=52 pending=0/0
(#queue r/w: in=53/53 out=52/52)

1435868941.791125   notice/Log::WRITER_ASCII  in=96 out=50 pending=0/0
(#queue r/w: in=96/96 out=50/50)

1435868941.791125   modbus/Log::WRITER_ASCII  in=50 out=49 pending=0/0
(#queue r/w: in=50/50 out=49/49)

1435868941.791125   traceroute/Log::WRITER_ASCII in=51 out=47 pending=0/0
(#queue r/w: in=51/51 out=47/47)








> Any slowness of select() due to a large number of fds probably puts more
> pressure back on workers and the remote connection as it’s involved in
> actually pulling data off the wire and also processing some chunks pulled
> off the wire and those intermediate buffers have a limit before shutting
> down connections with peers that talk too much.  This could cause thrashing
> of the actually connection between workers and manager (communication.log
> may reference stuff “filling up” or something like that), but it’s probably
> still possible to get in a situation where the manager’s logging thread
> queues still can’t dequeue entries (i.e. write to disk) fast enough and
> eventually hits OOM.
>


There are a few messages in communication.log about filling up but I'm not
sure they are frequent enough:

#fields ts      peer    src_name        connected_peer_desc
connected_peer_addr     connected_peer_port     level   message

#types  time    string  string  string  addr    port    string  string

1435870417.629212       manager child   -       -       -       info
[#10005/10.1.1.40:53113] connection closed

1435870417.629212       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870417.629212       manager script  -       -       -       info
connection closed

1435870417.629212       manager parent  -       -       -       info
[#10094/10.1.1.37:49822] peer disconnected

1435870417.629212       manager child   -       -       -       info
[#10094/10.1.1.37:49822] connection closed

1435870417.629212       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870417.629212       manager script  -       -       -       info
connection closed

1435870417.629212       manager parent  -       -       -       info
[#10090/10.1.1.41:33718] peer disconnected

1435870423.145758       manager child   -       -       -       info
[#10090/10.1.1.41:33718] connection closed

1435870423.145758       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870423.145758       manager script  -       -       -       info
connection closed

1435870423.145758       manager parent  -       -       -       info
[#10038/10.1.1.38:34331] peer disconnected

1435870423.145758       manager child   -       -       -       info
[#10038/10.1.1.38:34331] connection closed

1435870423.145758       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870423.145758       manager script  -       -       -       info
connection closed

1435870423.145758       manager parent  -       -       -       info
[#10128/10.1.1.39:37465] peer disconnected

1435870423.519705       manager child   -       -       -       info
[#10128/10.1.1.39:37465] connection closed

1435870423.519705       manager script  -       -       -       info
connection closed

1435870423.519705       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870423.519705       manager parent  -       -       -       info
[#10147/10.1.1.38:34353] peer disconnected

1435870423.873331       manager child   -       -       -       info
[#10147/10.1.1.38:34353] connection closed

1435870423.873331       manager child   -       -       -       error
queue to parent filling up; shutting down heaviest connection

1435870423.873331       manager script  -       -       -       info
connection closed

1435870429.806242       manager parent  -       -       -       info
[#10116/10.1.1.37:49813] peer disconnected

1435870429.806242       manager child   -       -       -       info
[#10116/10.1.1.37:49813] connection closed

1435869706.106072       proxy-4 parent  -       -       -       info
parent statistics: pending=0 bytes=515841K/3942969K chunks=2770217/19597747
io=1104183/1199656 bytes/io=0.47K/3.29K events=0/0
operations=1383192/9797004

1435869706.106072       proxy-4 parent  -       -       -       info
child statistics: [0] pending=0 bytes=0K/94K chunks=8/101 io=4/51
bytes/io=0.05K/1.85K

1435869706.106072       proxy-4 parent  -       -       -       info
child statistics: [1] pending=0 bytes=58392K/436287K chunks=306097/2178102
io=137164/1089052 bytes/io=0.43K/0.40K

1435869706.106072       proxy-4 parent  -       -       -       info
child statistics: [2] pending=0 bytes=57694K/436873K chunks=302777/2179288
io=135313/1089645 bytes/io=0.43K/0.40K
1435869706.106072       proxy-4 parent  -       -       -       info    child
statistics: [3] pending=0 bytes=54266K/440600K chunks=305363/2180192
io=136884/1090097 bytes/io=0.40K/0.40K


For yesterday:

bro at manager:/usr/local/bro$ zgrep -i fill logs/2015-07-01/communication.*

logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.00:15:40-00:20:54.log.gz:1435709777.104131
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735579.660321
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.07:26:19-07:31:10.log.gz:1435735799.681583
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745466.524085
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745689.339628
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.081825
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:11:06-10:15:02.log.gz:1435745690.888470
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745702.430527
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745734.889231
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745736.777859
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745737.839721
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.539531
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.10:15:02-10:21:46.log.gz:1435745738.821476
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.16:05:06-16:10:40.log.gz:1435766942.335851
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:39:56-19:45:50.log.gz:1435779596.437060
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779950.225082
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435779990.764378
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection

logs/2015-07-01/communication.19:45:50-19:53:35.log.gz:1435780017.662903
    manager child   -       -       -       error   queue to parent filling
up; shutting down heaviest connection








> And if you haven’t already, take a look at reporter.log for any scripting
> errors as certain types may cause memory leaks.


There's a regular stream of:

1435796386.319295       Reporter::ERROR incomplete base64 group, padding
with 12 bits of 0      (empty)

1435796386.701326       Reporter::ERROR character 45 ignored by Base64
decoding (empty)

1435796386.701326       Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0      (empty)

1435796388.426591       Reporter::ERROR extra base64 groups after '='
padding are ignored       (empty)

1435796388.426591       Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0      (empty)

1435796397.701170       Reporter::ERROR character 45 ignored by Base64
decoding (empty)

1435796397.701170       Reporter::ERROR incomplete base64 group, padding
with 18 bits of 0      (empty)

1435796400.416380       Reporter::ERROR extra base64 groups after '='
padding are ignored       (empty)
1435796400.416380       Reporter::ERROR incomplete base64 group, padding
with 6 bits of 0       (empty)


And occasionally messages for SumStat (port-scan, addr-scan,
detect-sql-attackers, traceroute-detection):

0.000000        Reporter::WARNING       SumStat key request for the
KoufzswOJrf SumStat uid took longer than 1 minute and was automatically
cancelled.  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000        Reporter::WARNING       SumStat key request for the
tVVpSQcL7Ac SumStat uid took longer than 1 minute and was automatically
cancelled.  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000        Reporter::WARNING       SumStat key request for the
fvaD58lPrkh SumStat uid took longer than 1 minute and was automatically
cancelled.  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
lin

e 218

0.000000        Reporter::WARNING       SumStat key request for the
mHN4TlfikJ6 SumStat uid took longer than 1 minute and was automatically
cancelled.  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 218

0.000000        Reporter::WARNING       Manager received a uid for an
unknown request.  SumStat: addr-scan, Key: [str=55554/tcp,
host=x.x.x.x]
/usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro,
line 379

0.000000        Reporter::WARNING       Manager received a uid for an
unknown request.  SumStat: addr-scan, Key: [str=445/tcp, host=x.x.x.x]
  /usr/local/bro/share/bro/base/frameworks/sumstats/./cluster.bro, line 379

0.000000        Reporter::WARNING       Manager received a uid for an
unknown request.  SumStat: addr-scan,
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20150702/3f241e6e/attachment-0001.html 


More information about the Bro mailing list