[Bro] Memory Consumption

Thu Jun 26 15:35:22 PDT 2014

Hi:

I believe this particular timer is a general timer used to track 
inactivity for all protocols (but someone can correct me if I'm wrong 
:).  Maybe try tuning the following:

const tcp_inactivity_timeout = 5 min &redef;
const udp_inactivity_timeout = 1 min &redef;
const icmp_inactivity_timeout = 1 min &redef;

Reference: 
http://www.notary.icsi.berkeley.edu/sphinx-git/scripts/base/init-bare.bro.html#id-udp_inactivity_timeout

Also, I believe it's possible to set timeouts per-connection based on 
properties of the established connections.  For an example of doing this 
/ how this might be useful, take a look at:

https://bro.org/sphinx/scripts/base/protocols/conn/inactivity.bro.html

Re: interpreting prof.log output -- a few notes from my experience:

There should be lines that include the number of connections currently 
active for each major protocol type, e.g:

Conns: tcp=1/130 udp=1/70 icmp=0/0

Syntax here is: tcp=<current active connections>/<maximum number of 
active connections ever observed> udp=<current active 
connections>/<maximum number of active connections ever observed> 
icmp=<current active connections>/<maximum number of active connections 
ever observed>

The line following the above includes more detailed connection overhead 
information:

Conns: total=6528 current=2/2 ext=0 mem=9312K avg=4656.0 table=24K 
connvals=6K

A few notes about fields that might be useful there:

* total=total number of connections (aggregate: not just at this 
particular moment)
* current=X/Y:  X and Y are two counts that will usually differ to some 
extent, but both count the number of connections observed
     - X: the number of active connections in total  (not necessarily 
all of which are kept in state tables)
     - Y: the number of connections stored in bro's state tables (tcp + 
udp + icmp) at this moment in time
* avg=average memory use (in bytes) per active connection
* table=total amount of memory used by the state tables (tcp + udp + icmp)

'avg' and 'table' are only recorded occasionally because computing these 
values can be expensive.  When that "Global_sizes ..." output is 
included in a log entry, these numbers will be accurate.  Otherwise, 
they will be 0.

For an idea of the overhead associated with the Timer objects themselves 
(read: the overhead for the timers isn't included in the overhead 
computed for the connection state), take a look at the line that looks 
something like:

Timers: current=19 max=19 mem=1K lag=0.00s

*current=number of timers currently active in total
*max=maximum number of timers ever active at once
*mem=total memory consumed by all of the currently active timers 
(usually pretty small compared to other things, though)

Also, one other note: under 'Threads', there's a bunch of lines that 
look something like:

http/Log::WRITER_ASCII    in=11318 out=10882 pending=0/0 (#queue r/w: 
in=11318/11318 out=10882/10882)
ssl/Log::WRITER_ASCII     in=10931 out=10878 pending=0/0 (#queue r/w: 
in=10931/10931 out=10878/10878)
files/Log::WRITER_ASCII   in=10989 out=10792 pending=0/0 (#queue r/w: 
in=10989/10989 out=10792/10792)
dhcp/Log::WRITER_ASCII    in=1031 out=1029 pending=0/0 (#queue r/w: 
in=1031/1031 out=1029/1029)

Generally, pending X/Y will describe how much memory is currently being 
consumed (relatively speaking) by messages waiting to be written to a 
log file / that have been read from that input source but not yet 
processed by bro.

A pending X/Y that grows steadily over time is an indication that bro 
could eventually run out of room to store outstanding log / input 
framework messages, and that these messages could eventually come to 
consume so much memory that the worker would thrash the machine into 
sweet digital oblivion.

Hope something in there is useful,
Gilbert

On 6/26/14, 2:26 PM, Jason Batchelor wrote:
> Small follow up to this as well since it may be relevant. I notice the 
> timers for stale connections seems to increase in paralel with memory...
> grep 'ConnectionInactivityTimer' prof.log | awk 'NR % 10 == 1'
> 1403802069.314888         ConnectionInactivityTimer = 5844
> 1403802219.315759         ConnectionInactivityTimer = 21747
> 1403802369.316387         ConnectionInactivityTimer = 32275
> 1403802519.317613         ConnectionInactivityTimer = 32716
> 1403802669.318303         ConnectionInactivityTimer = 32597
> 1403802819.319193         ConnectionInactivityTimer = 34207
> 1403802969.320204         ConnectionInactivityTimer = 39176
> 1403803119.321978         ConnectionInactivityTimer = 40394
> 1403803269.323058         ConnectionInactivityTimer = 38631
> 1403803419.323688         ConnectionInactivityTimer = 35847
> 1403803569.324716         ConnectionInactivityTimer = 34432
> 1403803719.325888         ConnectionInactivityTimer = 34591
> 1403803869.326713         ConnectionInactivityTimer = 34716
> 1403804019.327664         ConnectionInactivityTimer = 35361
> 1403804169.329254         ConnectionInactivityTimer = 35915
> 1403804319.330507         ConnectionInactivityTimer = 34994
> 1403804469.331842         ConnectionInactivityTimer = 33212
> 1403804619.332236         ConnectionInactivityTimer = 32290
> 1403804769.332993         ConnectionInactivityTimer = 32513
> 1403804919.333717         ConnectionInactivityTimer = 32592
> 1403805069.334477         ConnectionInactivityTimer = 32388
> 1403805219.334875         ConnectionInactivityTimer = 32932
> 1403805369.335753         ConnectionInactivityTimer = 31771
> 1403805519.337054         ConnectionInactivityTimer = 28749
> 1403805669.337563         ConnectionInactivityTimer = 26509
> 1403805819.339240         ConnectionInactivityTimer = 26654
> 1403805969.340812         ConnectionInactivityTimer = 26297
> 1403806119.341841         ConnectionInactivityTimer = 25362
> 1403806269.344342         ConnectionInactivityTimer = 24435
> 1403806419.345146         ConnectionInactivityTimer = 24954
> 1403806569.346057         ConnectionInactivityTimer = 24088
> 1403806719.347671         ConnectionInactivityTimer = 30207
> 1403806869.349643         ConnectionInactivityTimer = 34276
> Notice the steady increase, then slight decrease, then steady increase 
> again. Is there a way to control these settings for performance 
> testing purposes?
> I know while I was tuning Suricata, I needed to be mindful of 
> connection timeouts and due to the volume of flows I am getting I 
> needed to be pretty aggressive.
> Thanks,
> Jason
>
>
> On Thu, Jun 26, 2014 at 12:29 PM, Jason Batchelor 
> <jxbatchelor at gmail.com <mailto:jxbatchelor at gmail.com>> wrote:
>
>     Thanks Seth:
>
>     I'm not sure I have a license for an experianced bro memory
>     debugger, however I will document what I've done here for folks in
>     hopes it proves useful!
>
>     I've enabled profiling by adding the following.
>
>     Vim /opt/bro/share/bro/site/local.bro
>     @load misc/profiling
>
>     Then enforced the changes...
>
>     broctl stop
>     broctl install
>     broctl start
>
>     At the moment I have 46308184k used  3067820k free memory.
>
>     In /var/opt/bro/spool/worker-1-1, prof.log content is captured as
>     you mentioned (and likewise for all nodes).
>
>     Earlier you wrote:
>
>     >  Every so often in there will be an indication of the largest
>     global variables
>
>     Is this what you mean (taken from one worker)....?
>
>     1403803224.322453 Global_sizes > 100k: 0K
>     1403803224.322453              Known::known_services = 469K
>     (3130/3130 entries)
>     1403803224.322453              Cluster::manager2worker_events = 137K
>     1403803224.322453              Weird::weird_ignore = 31492K
>     (146569/146569 entries)
>     1403803224.322453              Known::certs = 58K (310/310 entries)
>     1403803224.322453              SumStats::threshold_tracker = 668K
>     (4/2916 entries)
>     1403803224.322453              FTP::ftp_data_expected = 181K
>     (46/46 entries)
>     1403803224.322453              Notice::suppressing = 595K
>     (2243/2243 entries)
>     1403803224.322453              Communication::connected_peers =
>     156K (2/2 entries)
>     1403803224.322453              SumStats::sending_results = 8028K
>     (3/5545 entries)
>     1403803224.322453              Software::tracked = 33477K
>     (12424/31111 entries)
>     1403803224.322453              FTP::cmd_reply_code = 48K (325/325
>     entries)
>     1403803224.322453              SumStats::result_store = 27962K
>     (5/19978 entries)
>     1403803224.322453              SSL::cipher_desc = 97K (356/356
>     entries)
>     1403803224.322453              RADIUS::attr_types = 44K (169/169
>     entries)
>     1403803224.322453              Weird::actions = 35K (163/163 entries)
>     1403803224.322453              Known::known_hosts = 3221K
>     (21773/21773 <tel:%2821773%2F21773> entries)
>     1403803224.322453              Weird::did_log = 54K (287/287 entries)
>     1403803224.322453              SSL::recently_validated_certs =
>     8667K (24752/24752 entries)
>     1403803224.322453              Communication::nodes = 188K (4/4
>     entries)
>     1403803224.322453              SSL::root_certs = 204K (144/144
>     entries)
>     1403803224.322453 Global_sizes total: 116727K
>     1403803224.322453 Total number of table entries: 213548/260715
>     1403803239.322685 ------------------------
>     1403803239.322685 Memory: total=1185296K total_adj=1137108K
>     malloced: 1144576K
>
>     Any other pointers on how to interpret this data?
>
>     FWIW, here are some additional statistics from the worker prof.log...
>
>     grep  "Memory: " prof.log | awk 'NR % 10 == 1'
>     0.000000 Memory: total=48188K total_adj=0K malloced: 47965K
>     1403802189.315606 Memory: total=614476K total_adj=566288K
>     malloced: 614022K
>     1403802339.316381 Memory: total=938380K total_adj=890192K
>     malloced: 938275K
>     1403802489.317426 Memory: total=1006168K total_adj=957980K
>     malloced: 1003385K
>     1403802639.318199 Memory: total=1041288K total_adj=993100K
>     malloced: 1035422K
>     1403802789.319107 Memory: total=1063544K total_adj=1015356K
>     malloced: 1058229K
>     1403802939.320170 Memory: total=1140652K total_adj=1092464K
>     malloced: 1139608K
>     1403803089.321327 Memory: total=1184540K total_adj=1136352K
>     malloced: 1179411K
>     1403803239.322685 Memory: total=1185296K total_adj=1137108K
>     malloced: 1144576K
>     1403803389.323680 Memory: total=1185296K total_adj=1137108K
>     malloced: 1118961K
>     1403803539.324677 Memory: total=1185296K total_adj=1137108K
>     malloced: 1092719K
>     1403803689.325763 Memory: total=1185296K total_adj=1137108K
>     malloced: 1091447K
>
>
>     On Thu, Jun 26, 2014 at 11:49 AM, Seth Hall <seth at icir.org
>     <mailto:seth at icir.org>> wrote:
>
>
>         On Jun 26, 2014, at 12:43 PM, Jason Batchelor
>         <jxbatchelor at gmail.com <mailto:jxbatchelor at gmail.com>> wrote:
>
>         > > Bro typically does consume quite a bit of memory and
>         you're a bit tight on memory for the number of workers you're
>         running.
>         > Curious what would you recommend for just bro itself?
>         Double, triple this?
>
>         It seems like most people just put 128G of memory in Bro boxes
>         now because the cost just isn't really worth going any lower
>         if there's a remote possibility you might use it.
>
>         > I will definately take a look, thanks for the info!
>
>         Feel free to ask again if you're having trouble.  We really
>         should write up some debugging documentation for this process
>         sometime.  Anyone with experience doing this memory debugging
>         activity up for it?  Doesn't have to be anything fancy, just
>         the steps and various things to look at to figure out what
>         exactly is happening.
>
>           .Seth
>
>         --
>         Seth Hall
>         International Computer Science Institute
>         (Bro) because everyone has a network
>         http://www.bro.org/
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140626/94bba0f3/attachment.html