[Bro-Dev] metrics framework

Tue May 10 14:29:58 PDT 2011

On Sun, May 08, 2011 at 22:24 -0400, you wrote:

> I'd appreciate if you guys took a look at the metrics framework and
> let me know what you think about it.

Pretty neat. 

Thoughts:

    - I'd split configuration of the metrics framework from adding
      data. Currently the data producer also configures things via
      create(), but it seems that's something better left to the user
      of the metrics framework. Doing so would also answer your point
      on setting up aggregation without using the create() function.

      Can you just skip the create() function altogether? From the
      producer's perspecive, that function isn't really doing
      anything, right?

      You would then instead provide a configure() function that a
      user of the metrics framework calls to define
      aggregration/break_interval/etc, either globally or optionally
      on a per ID basis.

      In the absense of any call to configure(), just pick some
      default, like aggregation per /24 and 10s intervals, or
      whatever. 

    - I'd move the $increment field out of DataPlug and make it a
      separate argument to add_data(). It has different semantics than
      the other fields, and you could then rename DataPlug to just
      Index.

    - When no subnet aggregation is set but $host is passed in, I
      think it won't work correctly. Your example for
      HTTP_REQUESTS_BY_HOST uses $index for per-host aggregration but
      looks like cheating. :-)

    - I'm wondering whether executing log_it() get expensive when it
      needs to iterate through too many entries. An alternative would
      be to schedule a number of more fine-granular timers (one per
      ID, or even one per aggregation unit); but then the log
      intervals would become desynchronized, which may not be
      desirable.  

> - Missing support for cluster deployment.

Yeah, that's a tough one. Full &synchronize would be overkill, but
sending the data via events, like you suggest, also sounds quite
expensive if there are lots of entities for which something's counted.

Here's an alternative idea: don't do any communication at all, and
just let the workers log their metrics data separately (into the same
log file but including a node id column). Then provide a script that
postprocesses metrics.log by adding up all the worker's counts for the
same unit/time interval. This might cause slight time
desynchronizations, but not sure how much impact that would have if we
set sufficiently large break intervals.

Perhaps the manager could trigger logging by sending the log_it()
events, and only then would all the worker go ahead and do their
output. If the log_it() event comes with a unique interval ID, the
worker can write that out as well and then offline aggregation will be
really easy later (and if they in addition also log their local
timestamps, one can see how well the timing matches).

> - Missing statistical support.

I'd leave that out for the first version. Or just do very a simple
piece: static thresholds relative to the break intervals (i.e.,
provide a function add_threshold(id, value) that alarms if a counter
for ID id exceeds value. 

> - I need to write a command line tool to convert the log into
> something that Graphviz can understand because I'd like to be able to
> enerate time-series graphs from these metrics really easily.

As everybody is mentioning his favorite tools, let me throw in mine. :-)
I also like matplotlib and R, in that order. But anything is fine with
me.

Robin

-- 
Robin Sommer * Phone +1 (510) 722-6541 * robin at icir.org
ICSI/LBNL    * Fax   +1 (510) 666-2956 *   www.icir.org