[Bro] Memory Consumption

Fri Jun 27 13:39:12 PDT 2014

Hi Jason:

It's only when *pending* begins to grow that there's a large queue of 
messages waiting to be written (as far as I know).  In this case, 
pending stays at 0/0 for each of those log updates, so I don't think 
that's an issue (for this log, at least :)

One possibility here is going to be that bro is actually leaking memory 
somewhere, and you're lucky enough to have found a bug :) Usually this 
is going to be identified by a steady rise in memory usage over time 
with a relatively constant traffic volume / types. One thing to try (if 
it's something you feel comfortable with) might be to pull down the 
current copy of bro in master, building that, and pushing this out to 
the nodes to see what kind of an effect that has (if anything).

Another thing to try might be to start tweaking which scripts are 
actually loaded in local.bro: comment out all the @load statements in 
local.bro (add a '#' to the beginning of each line) and see what kind of 
an effect that has on memory utilization.  Assuming memory usage drops, 
then you can start slowly start removing the # characters / restarting 
bro to re-load scripts one at a time.  This is going to be pretty 
tedious, though.

A third thing to try might be doing a bit of sampling so that bro only 
sees some % of incoming packets.  From there, slowly start to bring 
traffic back to see how bro's memory utilization rises as traffic is 
restored.  Might want to define a few different levels (drop 90% of 
connections, drop 75% of connections, drop 50% of connections, drop 25% 
of connections) and just leave the drop rate at each step for a while 
(read: a few hours would probably be a good start) to see what happens 
to the memory utilization.  If memory utilization stays pretty constant 
at each level, and the overall memory pattern ends up looking like a 
staircase, it might be time to consider that RAM upgrade :)

*** Note that it's important that traffic be connection-sampled when 
trying the above: packet-sampling will lead to dropped packets in the 
middle of connections, which might not work quite as expected. Seth: how 
does connection sampling work for the packet filter framework?  I 
haven't ever really used it, but I think it can do that, right?  Also, 
any other ideas / things I might be missing here?

One other question: did adjusting those timeouts change the number of 
inactivity timers reported?  Probably not relevant to this issue, but 
just wondering if the change had any measurable effect.

-Gilbert

On 6/27/14, 11:29 AM, Jason Batchelor wrote:
> Thanks Gilbert!
> I think I am getting close to at least isolating the issue.
> I redefined some of the inactivity timeout values to something pregger 
> aggressive...
> redef tcp_inactivity_timeout = 15 sec;
> redef udp_inactivity_timeout = 5 sec;
> redef icmp_inactivity_timeout = 5 sec;
> After committing the changes and restarting I am still seeing the same 
> kind of slow memory consumption behavior.
> I checked the IO statistics you gave above and think this is where I 
> am getting backed up. Below is a brief escalation of just the http 
> logs themselves.
> grep 'http/Log::WRITER_ASCII' prof.log | awk 'NR % 10 == 1'
> 1403880551.747191   http/Log::WRITER_ASCII    in=40 out=11 pending=0/0 
> (#queue r/w: in=40/40 out=11/11)
> 1403880701.759282   http/Log::WRITER_ASCII    in=632 out=160 
> pending=0/0 (#queue r/w: in=632/632 out=160/160)
> 1403880851.764553   http/Log::WRITER_ASCII    in=1254 out=310 
> pending=0/0 (#queue r/w: in=1254/1254 out=310/310)
> 1403881001.794827   http/Log::WRITER_ASCII    in=1881 out=459 
> pending=0/0 (#queue r/w: in=1881/1881 out=459/459)
> 1403881151.907771   http/Log::WRITER_ASCII    in=2496 out=607 
> pending=0/0 (#queue r/w: in=2496/2496 out=607/607)
> 1403881302.133110   http/Log::WRITER_ASCII    in=3140 out=754 
> pending=0/0 (#queue r/w: in=3140/3140 out=754/754)
> 1403881452.684259   http/Log::WRITER_ASCII    in=3781 out=900 
> pending=0/0 (#queue r/w: in=3781/3781 out=900/900)
> 1403881611.446692   http/Log::WRITER_ASCII    in=4321 out=1000 
> pending=0/0 (#queue r/w: in=4321/4321 out=1000/1000)
> 1403881783.945970   http/Log::WRITER_ASCII    in=4816 out=1069 
> pending=0/0 (#queue r/w: in=4816/4816 out=1069/1069)
> 1403881991.154812   http/Log::WRITER_ASCII    in=5435 out=1105 
> pending=0/0 (#queue r/w: in=5435/5435 out=1105/1105)
> 1403882156.814938   http/Log::WRITER_ASCII    in=6066 out=1190 
> pending=0/0 (#queue r/w: in=6066/6066 out=1190/1190)
> 1403882336.215055   http/Log::WRITER_ASCII    in=6690 out=1267 
> pending=0/0 (#queue r/w: in=6690/6690 out=1267/1267)
> 1403882494.089058   http/Log::WRITER_ASCII    in=7350 out=1377 
> pending=0/0 (#queue r/w: in=7350/7350 out=1377/1377)
> If I am interpreting this correctly, I am far exceeding my ability to 
> write out logs as time goes on, resulting in a backup of that data in 
> memory presumably. The same kind of behavior is seen in other log 
> types as well.
> Am I interpreting this correctly? If so the solution seems to be I 
> need faster drives and/or more memory :)
>
>
> On Thu, Jun 26, 2014 at 5:35 PM, Gilbert Clark <gc355804 at ohio.edu 
> <mailto:gc355804 at ohio.edu>> wrote:
>
>     Hi:
>
>     I believe this particular timer is a general timer used to track
>     inactivity for all protocols (but someone can correct me if I'm
>     wrong :).  Maybe try tuning the following:
>
>     const tcp_inactivity_timeout = 5 min &redef;
>     const udp_inactivity_timeout = 1 min &redef;
>     const icmp_inactivity_timeout = 1 min &redef;
>
>     Reference:
>     http://www.notary.icsi.berkeley.edu/sphinx-git/scripts/base/init-bare.bro.html#id-udp_inactivity_timeout
>
>     Also, I believe it's possible to set timeouts per-connection based
>     on properties of the established connections.  For an example of
>     doing this / how this might be useful, take a look at:
>
>     https://bro.org/sphinx/scripts/base/protocols/conn/inactivity.bro.html
>
>     Re: interpreting prof.log output -- a few notes from my experience:
>
>     There should be lines that include the number of connections
>     currently active for each major protocol type, e.g:
>
>     Conns: tcp=1/130 udp=1/70 icmp=0/0
>
>     Syntax here is: tcp=<current active connections>/<maximum number
>     of active connections ever observed> udp=<current active
>     connections>/<maximum number of active connections ever observed>
>     icmp=<current active connections>/<maximum number of active
>     connections ever observed>
>
>     The line following the above includes more detailed connection
>     overhead information:
>
>     Conns: total=6528 current=2/2 ext=0 mem=9312K avg=4656.0 table=24K
>     connvals=6K
>
>     A few notes about fields that might be useful there:
>
>     * total=total number of connections (aggregate: not just at this
>     particular moment)
>     * current=X/Y:  X and Y are two counts that will usually differ to
>     some extent, but both count the number of connections observed
>         - X: the number of active connections in total  (not
>     necessarily all of which are kept in state tables)
>         - Y: the number of connections stored in bro's state tables
>     (tcp + udp + icmp) at this moment in time
>     * avg=average memory use (in bytes) per active connection
>     * table=total amount of memory used by the state tables (tcp + udp
>     + icmp)
>
>     'avg' and 'table' are only recorded occasionally because computing
>     these values can be expensive.  When that "Global_sizes ..."
>     output is included in a log entry, these numbers will be
>     accurate.  Otherwise, they will be 0.
>
>     For an idea of the overhead associated with the Timer objects
>     themselves (read: the overhead for the timers isn't included in
>     the overhead computed for the connection state), take a look at
>     the line that looks something like:
>
>     Timers: current=19 max=19 mem=1K lag=0.00s
>
>     *current=number of timers currently active in total
>     *max=maximum number of timers ever active at once
>     *mem=total memory consumed by all of the currently active timers
>     (usually pretty small compared to other things, though)
>
>     Also, one other note: under 'Threads', there's a bunch of lines
>     that look something like:
>
>     http/Log::WRITER_ASCII    in=11318 out=10882 pending=0/0 (#queue
>     r/w: in=11318/11318 out=10882/10882)
>     ssl/Log::WRITER_ASCII     in=10931 out=10878 pending=0/0 (#queue
>     r/w: in=10931/10931 out=10878/10878)
>     files/Log::WRITER_ASCII   in=10989 out=10792 pending=0/0 (#queue
>     r/w: in=10989/10989 out=10792/10792)
>     dhcp/Log::WRITER_ASCII    in=1031 out=1029 pending=0/0 (#queue
>     r/w: in=1031/1031 out=1029/1029)
>
>     Generally, pending X/Y will describe how much memory is currently
>     being consumed (relatively speaking) by messages waiting to be
>     written to a log file / that have been read from that input source
>     but not yet processed by bro.
>
>     A pending X/Y that grows steadily over time is an indication that
>     bro could eventually run out of room to store outstanding log /
>     input framework messages, and that these messages could eventually
>     come to consume so much memory that the worker would thrash the
>     machine into sweet digital oblivion.
>
>     Hope something in there is useful,
>     Gilbert
>
>
>     On 6/26/14, 2:26 PM, Jason Batchelor wrote:
>>     Small follow up to this as well since it may be relevant. I
>>     notice the timers for stale connections seems to increase in
>>     paralel with memory...
>>     grep 'ConnectionInactivityTimer' prof.log | awk 'NR % 10 == 1'
>>     1403802069.314888 ConnectionInactivityTimer = 5844
>>     1403802219.315759 ConnectionInactivityTimer = 21747
>>     1403802369.316387 ConnectionInactivityTimer = 32275
>>     1403802519.317613 ConnectionInactivityTimer = 32716
>>     1403802669.318303 ConnectionInactivityTimer = 32597
>>     1403802819.319193 ConnectionInactivityTimer = 34207
>>     1403802969.320204 ConnectionInactivityTimer = 39176
>>     1403803119.321978 ConnectionInactivityTimer = 40394
>>     1403803269.323058 ConnectionInactivityTimer = 38631
>>     1403803419.323688 ConnectionInactivityTimer = 35847
>>     1403803569.324716 ConnectionInactivityTimer = 34432
>>     1403803719.325888 ConnectionInactivityTimer = 34591
>>     1403803869.326713 ConnectionInactivityTimer = 34716
>>     1403804019.327664 ConnectionInactivityTimer = 35361
>>     1403804169.329254 ConnectionInactivityTimer = 35915
>>     1403804319.330507 ConnectionInactivityTimer = 34994
>>     1403804469.331842 ConnectionInactivityTimer = 33212
>>     1403804619.332236 ConnectionInactivityTimer = 32290
>>     1403804769.332993 ConnectionInactivityTimer = 32513
>>     1403804919.333717 ConnectionInactivityTimer = 32592
>>     1403805069.334477 ConnectionInactivityTimer = 32388
>>     1403805219.334875 ConnectionInactivityTimer = 32932
>>     1403805369.335753 ConnectionInactivityTimer = 31771
>>     1403805519.337054 ConnectionInactivityTimer = 28749
>>     1403805669.337563 ConnectionInactivityTimer = 26509
>>     1403805819.339240 ConnectionInactivityTimer = 26654
>>     1403805969.340812 ConnectionInactivityTimer = 26297
>>     1403806119.341841 ConnectionInactivityTimer = 25362
>>     1403806269.344342 ConnectionInactivityTimer = 24435
>>     1403806419.345146 ConnectionInactivityTimer = 24954
>>     1403806569.346057 ConnectionInactivityTimer = 24088
>>     1403806719.347671 ConnectionInactivityTimer = 30207
>>     1403806869.349643 ConnectionInactivityTimer = 34276
>>     Notice the steady increase, then slight decrease, then steady
>>     increase again. Is there a way to control these settings for
>>     performance testing purposes?
>>     I know while I was tuning Suricata, I needed to be mindful of
>>     connection timeouts and due to the volume of flows I am getting I
>>     needed to be pretty aggressive.
>>     Thanks,
>>     Jason
>>
>>
>>     On Thu, Jun 26, 2014 at 12:29 PM, Jason Batchelor
>>     <jxbatchelor at gmail.com <mailto:jxbatchelor at gmail.com>> wrote:
>>
>>         Thanks Seth:
>>
>>         I'm not sure I have a license for an experianced bro memory
>>         debugger, however I will document what I've done here for
>>         folks in hopes it proves useful!
>>
>>         I've enabled profiling by adding the following.
>>
>>         Vim /opt/bro/share/bro/site/local.bro
>>         @load misc/profiling
>>
>>         Then enforced the changes...
>>
>>         broctl stop
>>         broctl install
>>         broctl start
>>
>>         At the moment I have 46308184k used  3067820k free memory.
>>
>>         In /var/opt/bro/spool/worker-1-1, prof.log content is
>>         captured as you mentioned (and likewise for all nodes).
>>
>>         Earlier you wrote:
>>
>>         >  Every so often in there will be an indication of the
>>         largest global variables
>>
>>         Is this what you mean (taken from one worker)....?
>>
>>         1403803224.322453 Global_sizes > 100k: 0K
>>         1403803224.322453                Known::known_services = 469K
>>         (3130/3130 entries)
>>         1403803224.322453  Cluster::manager2worker_events = 137K
>>         1403803224.322453                Weird::weird_ignore = 31492K
>>         (146569/146569 entries)
>>         1403803224.322453                Known::certs = 58K (310/310
>>         entries)
>>         1403803224.322453  SumStats::threshold_tracker = 668K (4/2916
>>         entries)
>>         1403803224.322453                FTP::ftp_data_expected =
>>         181K (46/46 entries)
>>         1403803224.322453                Notice::suppressing = 595K
>>         (2243/2243 entries)
>>         1403803224.322453  Communication::connected_peers = 156K (2/2
>>         entries)
>>         1403803224.322453  SumStats::sending_results = 8028K (3/5545
>>         entries)
>>         1403803224.322453                Software::tracked = 33477K
>>         (12424/31111 entries)
>>         1403803224.322453                FTP::cmd_reply_code = 48K
>>         (325/325 entries)
>>         1403803224.322453                SumStats::result_store =
>>         27962K (5/19978 entries)
>>         1403803224.322453                SSL::cipher_desc = 97K
>>         (356/356 entries)
>>         1403803224.322453                RADIUS::attr_types = 44K
>>         (169/169 entries)
>>         1403803224.322453                Weird::actions = 35K
>>         (163/163 entries)
>>         1403803224.322453                Known::known_hosts = 3221K
>>         (21773/21773 <tel:%2821773%2F21773> entries)
>>         1403803224.322453                Weird::did_log = 54K
>>         (287/287 entries)
>>         1403803224.322453  SSL::recently_validated_certs = 8667K
>>         (24752/24752 entries)
>>         1403803224.322453                Communication::nodes = 188K
>>         (4/4 entries)
>>         1403803224.322453                SSL::root_certs = 204K
>>         (144/144 entries)
>>         1403803224.322453 Global_sizes total: 116727K
>>         1403803224.322453 Total number of table entries: 213548/260715
>>         1403803239.322685 ------------------------
>>         1403803239.322685 Memory: total=1185296K total_adj=1137108K
>>         malloced: 1144576K
>>
>>         Any other pointers on how to interpret this data?
>>
>>         FWIW, here are some additional statistics from the worker
>>         prof.log...
>>
>>         grep  "Memory: " prof.log | awk 'NR % 10 == 1'
>>         0.000000 Memory: total=48188K total_adj=0K malloced: 47965K
>>         1403802189.315606 Memory: total=614476K total_adj=566288K
>>         malloced: 614022K
>>         1403802339.316381 Memory: total=938380K total_adj=890192K
>>         malloced: 938275K
>>         1403802489.317426 Memory: total=1006168K total_adj=957980K
>>         malloced: 1003385K
>>         1403802639.318199 Memory: total=1041288K total_adj=993100K
>>         malloced: 1035422K
>>         1403802789.319107 Memory: total=1063544K total_adj=1015356K
>>         malloced: 1058229K
>>         1403802939.320170 Memory: total=1140652K total_adj=1092464K
>>         malloced: 1139608K
>>         1403803089.321327 Memory: total=1184540K total_adj=1136352K
>>         malloced: 1179411K
>>         1403803239.322685 Memory: total=1185296K total_adj=1137108K
>>         malloced: 1144576K
>>         1403803389.323680 Memory: total=1185296K total_adj=1137108K
>>         malloced: 1118961K
>>         1403803539.324677 Memory: total=1185296K total_adj=1137108K
>>         malloced: 1092719K
>>         1403803689.325763 Memory: total=1185296K total_adj=1137108K
>>         malloced: 1091447K
>>
>>
>>         On Thu, Jun 26, 2014 at 11:49 AM, Seth Hall <seth at icir.org
>>         <mailto:seth at icir.org>> wrote:
>>
>>
>>             On Jun 26, 2014, at 12:43 PM, Jason Batchelor
>>             <jxbatchelor at gmail.com <mailto:jxbatchelor at gmail.com>> wrote:
>>
>>             > > Bro typically does consume quite a bit of memory and
>>             you're a bit tight on memory for the number of workers
>>             you're running.
>>             > Curious what would you recommend for just bro itself?
>>             Double, triple this?
>>
>>             It seems like most people just put 128G of memory in Bro
>>             boxes now because the cost just isn't really worth going
>>             any lower if there's a remote possibility you might use it.
>>
>>             > I will definately take a look, thanks for the info!
>>
>>             Feel free to ask again if you're having trouble.  We
>>             really should write up some debugging documentation for
>>             this process sometime.  Anyone with experience doing this
>>             memory debugging activity up for it?  Doesn't have to be
>>             anything fancy, just the steps and various things to look
>>             at to figure out what exactly is happening.
>>
>>               .Seth
>>
>>             --
>>             Seth Hall
>>             International Computer Science Institute
>>             (Bro) because everyone has a network
>>             http://www.bro.org/
>>
>>
>>
>
>
>     _______________________________________________
>     Bro mailing list
>     bro at bro-ids.org <mailto:bro at bro-ids.org>
>     http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140627/0375b4be/attachment.html