[Bro] Memory Consumption

Jason Batchelor jxbatchelor at gmail.com
Fri Jun 27 08:29:12 PDT 2014


Thanks Gilbert!

I think I am getting close to at least isolating the issue.

I redefined some of the inactivity timeout values to something pregger
aggressive...
redef tcp_inactivity_timeout = 15 sec;
redef udp_inactivity_timeout = 5 sec;
redef icmp_inactivity_timeout = 5 sec;

After committing the changes and restarting I am still seeing the same kind
of slow memory consumption behavior.

I checked the IO statistics you gave above and think this is where I am
getting backed up. Below is a brief escalation of just the http logs
themselves.

grep 'http/Log::WRITER_ASCII' prof.log | awk 'NR % 10 == 1'
1403880551.747191   http/Log::WRITER_ASCII    in=40 out=11 pending=0/0
(#queue r/w: in=40/40 out=11/11)
1403880701.759282   http/Log::WRITER_ASCII    in=632 out=160 pending=0/0
(#queue r/w: in=632/632 out=160/160)
1403880851.764553   http/Log::WRITER_ASCII    in=1254 out=310 pending=0/0
(#queue r/w: in=1254/1254 out=310/310)
1403881001.794827   http/Log::WRITER_ASCII    in=1881 out=459 pending=0/0
(#queue r/w: in=1881/1881 out=459/459)
1403881151.907771   http/Log::WRITER_ASCII    in=2496 out=607 pending=0/0
(#queue r/w: in=2496/2496 out=607/607)
1403881302.133110   http/Log::WRITER_ASCII    in=3140 out=754 pending=0/0
(#queue r/w: in=3140/3140 out=754/754)
1403881452.684259   http/Log::WRITER_ASCII    in=3781 out=900 pending=0/0
(#queue r/w: in=3781/3781 out=900/900)
1403881611.446692   http/Log::WRITER_ASCII    in=4321 out=1000 pending=0/0
(#queue r/w: in=4321/4321 out=1000/1000)
1403881783.945970   http/Log::WRITER_ASCII    in=4816 out=1069 pending=0/0
(#queue r/w: in=4816/4816 out=1069/1069)
1403881991.154812   http/Log::WRITER_ASCII    in=5435 out=1105 pending=0/0
(#queue r/w: in=5435/5435 out=1105/1105)
1403882156.814938   http/Log::WRITER_ASCII    in=6066 out=1190 pending=0/0
(#queue r/w: in=6066/6066 out=1190/1190)
1403882336.215055   http/Log::WRITER_ASCII    in=6690 out=1267 pending=0/0
(#queue r/w: in=6690/6690 out=1267/1267)
1403882494.089058   http/Log::WRITER_ASCII    in=7350 out=1377 pending=0/0
(#queue r/w: in=7350/7350 out=1377/1377)

If I am interpreting this correctly, I am far exceeding my ability to write
out logs as time goes on, resulting in a backup of that data in memory
presumably. The same kind of behavior is seen in other log types as well.

Am I interpreting this correctly? If so the solution seems to be I need
faster drives and/or more memory :)



On Thu, Jun 26, 2014 at 5:35 PM, Gilbert Clark <gc355804 at ohio.edu> wrote:

>  Hi:
>
> I believe this particular timer is a general timer used to track
> inactivity for all protocols (but someone can correct me if I'm wrong :).
> Maybe try tuning the following:
>
> const tcp_inactivity_timeout = 5 min &redef;
> const udp_inactivity_timeout = 1 min &redef;
> const icmp_inactivity_timeout = 1 min &redef;
>
> Reference:
> http://www.notary.icsi.berkeley.edu/sphinx-git/scripts/base/init-bare.bro.html#id-udp_inactivity_timeout
>
> Also, I believe it's possible to set timeouts per-connection based on
> properties of the established connections.  For an example of doing this /
> how this might be useful, take a look at:
>
> https://bro.org/sphinx/scripts/base/protocols/conn/inactivity.bro.html
>
> Re: interpreting prof.log output -- a few notes from my experience:
>
> There should be lines that include the number of connections currently
> active for each major protocol type, e.g:
>
> Conns: tcp=1/130 udp=1/70 icmp=0/0
>
> Syntax here is: tcp=<current active connections>/<maximum number of active
> connections ever observed> udp=<current active connections>/<maximum number
> of active connections ever observed> icmp=<current active
> connections>/<maximum number of active connections ever observed>
>
> The line following the above includes more detailed connection overhead
> information:
>
> Conns: total=6528 current=2/2 ext=0 mem=9312K avg=4656.0 table=24K
> connvals=6K
>
> A few notes about fields that might be useful there:
>
> * total=total number of connections (aggregate: not just at this
> particular moment)
> * current=X/Y:  X and Y are two counts that will usually differ to some
> extent, but both count the number of connections observed
>     - X: the number of active connections in total  (not necessarily all
> of which are kept in state tables)
>     - Y: the number of connections stored in bro's state tables (tcp + udp
> + icmp) at this moment in time
> * avg=average memory use (in bytes) per active connection
> * table=total amount of memory used by the state tables (tcp + udp + icmp)
>
> 'avg' and 'table' are only recorded occasionally because computing these
> values can be expensive.  When that "Global_sizes ..." output is included
> in a log entry, these numbers will be accurate.  Otherwise, they will be 0.
>
> For an idea of the overhead associated with the Timer objects themselves
> (read: the overhead for the timers isn't included in the overhead computed
> for the connection state), take a look at the line that looks something
> like:
>
> Timers: current=19 max=19 mem=1K lag=0.00s
>
> *current=number of timers currently active in total
> *max=maximum number of timers ever active at once
> *mem=total memory consumed by all of the currently active timers (usually
> pretty small compared to other things, though)
>
> Also, one other note: under 'Threads', there's a bunch of lines that look
> something like:
>
> http/Log::WRITER_ASCII    in=11318 out=10882 pending=0/0 (#queue r/w:
> in=11318/11318 out=10882/10882)
> ssl/Log::WRITER_ASCII     in=10931 out=10878 pending=0/0 (#queue r/w:
> in=10931/10931 out=10878/10878)
> files/Log::WRITER_ASCII   in=10989 out=10792 pending=0/0 (#queue r/w:
> in=10989/10989 out=10792/10792)
> dhcp/Log::WRITER_ASCII    in=1031 out=1029 pending=0/0 (#queue r/w:
> in=1031/1031 out=1029/1029)
>
> Generally, pending X/Y will describe how much memory is currently being
> consumed (relatively speaking) by messages waiting to be written to a log
> file / that have been read from that input source but not yet processed by
> bro.
>
> A pending X/Y that grows steadily over time is an indication that bro
> could eventually run out of room to store outstanding log / input framework
> messages, and that these messages could eventually come to consume so much
> memory that the worker would thrash the machine into sweet digital oblivion.
>
> Hope something in there is useful,
> Gilbert
>
>
> On 6/26/14, 2:26 PM, Jason Batchelor wrote:
>
>  Small follow up to this as well since it may be relevant. I notice the
> timers for stale connections seems to increase in paralel with memory...
>
> grep 'ConnectionInactivityTimer' prof.log | awk 'NR % 10 == 1'
> 1403802069.314888         ConnectionInactivityTimer = 5844
> 1403802219.315759         ConnectionInactivityTimer = 21747
> 1403802369.316387         ConnectionInactivityTimer = 32275
> 1403802519.317613         ConnectionInactivityTimer = 32716
> 1403802669.318303         ConnectionInactivityTimer = 32597
> 1403802819.319193         ConnectionInactivityTimer = 34207
> 1403802969.320204         ConnectionInactivityTimer = 39176
> 1403803119.321978         ConnectionInactivityTimer = 40394
> 1403803269.323058         ConnectionInactivityTimer = 38631
> 1403803419.323688         ConnectionInactivityTimer = 35847
> 1403803569.324716         ConnectionInactivityTimer = 34432
> 1403803719.325888         ConnectionInactivityTimer = 34591
> 1403803869.326713         ConnectionInactivityTimer = 34716
> 1403804019.327664         ConnectionInactivityTimer = 35361
> 1403804169.329254         ConnectionInactivityTimer = 35915
> 1403804319.330507         ConnectionInactivityTimer = 34994
> 1403804469.331842         ConnectionInactivityTimer = 33212
> 1403804619.332236         ConnectionInactivityTimer = 32290
> 1403804769.332993         ConnectionInactivityTimer = 32513
> 1403804919.333717         ConnectionInactivityTimer = 32592
> 1403805069.334477         ConnectionInactivityTimer = 32388
> 1403805219.334875         ConnectionInactivityTimer = 32932
> 1403805369.335753         ConnectionInactivityTimer = 31771
> 1403805519.337054         ConnectionInactivityTimer = 28749
> 1403805669.337563         ConnectionInactivityTimer = 26509
> 1403805819.339240         ConnectionInactivityTimer = 26654
> 1403805969.340812         ConnectionInactivityTimer = 26297
> 1403806119.341841         ConnectionInactivityTimer = 25362
> 1403806269.344342         ConnectionInactivityTimer = 24435
> 1403806419.345146         ConnectionInactivityTimer = 24954
> 1403806569.346057         ConnectionInactivityTimer = 24088
> 1403806719.347671         ConnectionInactivityTimer = 30207
> 1403806869.349643         ConnectionInactivityTimer = 34276
>
> Notice the steady increase, then slight decrease, then steady increase
> again. Is there a way to control these settings for performance testing
> purposes?
>
> I know while I was tuning Suricata, I needed to be mindful of connection
> timeouts and due to the volume of flows I am getting I needed to be pretty
> aggressive.
>
> Thanks,
> Jason
>
>
>
> On Thu, Jun 26, 2014 at 12:29 PM, Jason Batchelor <jxbatchelor at gmail.com>
> wrote:
>
>> Thanks Seth:
>>
>>  I'm not sure I have a license for an experianced bro memory debugger,
>> however I will document what I've done here for folks in hopes it proves
>> useful!
>>
>>  I've enabled profiling by adding the following.
>>
>>  Vim /opt/bro/share/bro/site/local.bro
>> @load misc/profiling
>>
>>  Then enforced the changes...
>>
>>  broctl stop
>> broctl install
>> broctl start
>>
>>  At the moment I have 46308184k used  3067820k free memory.
>>
>>  In /var/opt/bro/spool/worker-1-1, prof.log content is captured as you
>> mentioned (and likewise for all nodes).
>>
>>  Earlier you wrote:
>>
>>  >  Every so often in there will be an indication of the largest global
>> variables
>>
>>  Is this what you mean (taken from one worker)....?
>>
>>  1403803224.322453 Global_sizes > 100k: 0K
>> 1403803224.322453                Known::known_services = 469K (3130/3130
>> entries)
>> 1403803224.322453                Cluster::manager2worker_events = 137K
>> 1403803224.322453                Weird::weird_ignore = 31492K
>> (146569/146569 entries)
>> 1403803224.322453                Known::certs = 58K (310/310 entries)
>> 1403803224.322453                SumStats::threshold_tracker = 668K
>> (4/2916 entries)
>> 1403803224.322453                FTP::ftp_data_expected = 181K (46/46
>> entries)
>> 1403803224.322453                Notice::suppressing = 595K (2243/2243
>> entries)
>> 1403803224.322453                Communication::connected_peers = 156K
>> (2/2 entries)
>> 1403803224.322453                SumStats::sending_results = 8028K
>> (3/5545 entries)
>> 1403803224.322453                Software::tracked = 33477K (12424/31111
>> entries)
>> 1403803224.322453                FTP::cmd_reply_code = 48K (325/325
>> entries)
>> 1403803224.322453                SumStats::result_store = 27962K (5/19978
>> entries)
>> 1403803224.322453                SSL::cipher_desc = 97K (356/356 entries)
>> 1403803224.322453                RADIUS::attr_types = 44K (169/169
>> entries)
>> 1403803224.322453                Weird::actions = 35K (163/163 entries)
>> 1403803224.322453                Known::known_hosts = 3221K (21773/21773
>> entries)
>> 1403803224.322453                Weird::did_log = 54K (287/287 entries)
>> 1403803224.322453                SSL::recently_validated_certs = 8667K
>> (24752/24752 entries)
>> 1403803224.322453                Communication::nodes = 188K (4/4 entries)
>> 1403803224.322453                SSL::root_certs = 204K (144/144 entries)
>> 1403803224.322453 Global_sizes total: 116727K
>>  1403803224.322453 Total number of table entries: 213548/260715
>> 1403803239.322685 ------------------------
>> 1403803239.322685 Memory: total=1185296K total_adj=1137108K malloced:
>> 1144576K
>>
>>  Any other pointers on how to interpret this data?
>>
>>  FWIW, here are some additional statistics from the worker prof.log...
>>
>>  grep  "Memory: " prof.log | awk 'NR % 10 == 1'
>> 0.000000 Memory: total=48188K total_adj=0K malloced: 47965K
>> 1403802189.315606 Memory: total=614476K total_adj=566288K malloced:
>> 614022K
>> 1403802339.316381 Memory: total=938380K total_adj=890192K malloced:
>> 938275K
>> 1403802489.317426 Memory: total=1006168K total_adj=957980K malloced:
>> 1003385K
>> 1403802639.318199 Memory: total=1041288K total_adj=993100K malloced:
>> 1035422K
>> 1403802789.319107 Memory: total=1063544K total_adj=1015356K malloced:
>> 1058229K
>> 1403802939.320170 Memory: total=1140652K total_adj=1092464K malloced:
>> 1139608K
>> 1403803089.321327 Memory: total=1184540K total_adj=1136352K malloced:
>> 1179411K
>> 1403803239.322685 Memory: total=1185296K total_adj=1137108K malloced:
>> 1144576K
>> 1403803389.323680 Memory: total=1185296K total_adj=1137108K malloced:
>> 1118961K
>> 1403803539.324677 Memory: total=1185296K total_adj=1137108K malloced:
>> 1092719K
>> 1403803689.325763 Memory: total=1185296K total_adj=1137108K malloced:
>> 1091447K
>>
>>
>> On Thu, Jun 26, 2014 at 11:49 AM, Seth Hall <seth at icir.org> wrote:
>>
>>>
>>> On Jun 26, 2014, at 12:43 PM, Jason Batchelor <jxbatchelor at gmail.com>
>>> wrote:
>>>
>>> > > Bro typically does consume quite a bit of memory and you're a bit
>>> tight on memory for the number of workers you're running.
>>> > Curious what would you recommend for just bro itself? Double, triple
>>> this?
>>>
>>>  It seems like most people just put 128G of memory in Bro boxes now
>>> because the cost just isn't really worth going any lower if there's a
>>> remote possibility you might use it.
>>>
>>> > I will definately take a look, thanks for the info!
>>>
>>>  Feel free to ask again if you're having trouble.  We really should
>>> write up some debugging documentation for this process sometime.  Anyone
>>> with experience doing this memory debugging activity up for it?  Doesn't
>>> have to be anything fancy, just the steps and various things to look at to
>>> figure out what exactly is happening.
>>>
>>>   .Seth
>>>
>>> --
>>> Seth Hall
>>> International Computer Science Institute
>>> (Bro) because everyone has a network
>>> http://www.bro.org/
>>>
>>>
>>
>
>
> _______________________________________________
> Bro mailing list
> bro at bro-ids.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140627/5e9eba0d/attachment.html 


More information about the Bro mailing list