[Bro-Dev] metrics framework
Robin Sommer
robin at icir.org
Tue May 10 14:29:58 PDT 2011
On Sun, May 08, 2011 at 22:24 -0400, you wrote:
> I'd appreciate if you guys took a look at the metrics framework and
> let me know what you think about it.
Pretty neat.
Thoughts:
- I'd split configuration of the metrics framework from adding
data. Currently the data producer also configures things via
create(), but it seems that's something better left to the user
of the metrics framework. Doing so would also answer your point
on setting up aggregation without using the create() function.
Can you just skip the create() function altogether? From the
producer's perspecive, that function isn't really doing
anything, right?
You would then instead provide a configure() function that a
user of the metrics framework calls to define
aggregration/break_interval/etc, either globally or optionally
on a per ID basis.
In the absense of any call to configure(), just pick some
default, like aggregation per /24 and 10s intervals, or
whatever.
- I'd move the $increment field out of DataPlug and make it a
separate argument to add_data(). It has different semantics than
the other fields, and you could then rename DataPlug to just
Index.
- When no subnet aggregation is set but $host is passed in, I
think it won't work correctly. Your example for
HTTP_REQUESTS_BY_HOST uses $index for per-host aggregration but
looks like cheating. :-)
- I'm wondering whether executing log_it() get expensive when it
needs to iterate through too many entries. An alternative would
be to schedule a number of more fine-granular timers (one per
ID, or even one per aggregation unit); but then the log
intervals would become desynchronized, which may not be
desirable.
> - Missing support for cluster deployment.
Yeah, that's a tough one. Full &synchronize would be overkill, but
sending the data via events, like you suggest, also sounds quite
expensive if there are lots of entities for which something's counted.
Here's an alternative idea: don't do any communication at all, and
just let the workers log their metrics data separately (into the same
log file but including a node id column). Then provide a script that
postprocesses metrics.log by adding up all the worker's counts for the
same unit/time interval. This might cause slight time
desynchronizations, but not sure how much impact that would have if we
set sufficiently large break intervals.
Perhaps the manager could trigger logging by sending the log_it()
events, and only then would all the worker go ahead and do their
output. If the log_it() event comes with a unique interval ID, the
worker can write that out as well and then offline aggregation will be
really easy later (and if they in addition also log their local
timestamps, one can see how well the timing matches).
> - Missing statistical support.
I'd leave that out for the first version. Or just do very a simple
piece: static thresholds relative to the break intervals (i.e.,
provide a function add_threshold(id, value) that alarms if a counter
for ID id exceeds value.
> - I need to write a command line tool to convert the log into
> something that Graphviz can understand because I'd like to be able to
> enerate time-series graphs from these metrics really easily.
As everybody is mentioning his favorite tools, let me throw in mine. :-)
I also like matplotlib and R, in that order. But anything is fine with
me.
Robin
--
Robin Sommer * Phone +1 (510) 722-6541 * robin at icir.org
ICSI/LBNL * Fax +1 (510) 666-2956 * www.icir.org
More information about the bro-dev
mailing list