[Bro-Dev] Broker data layouts

Thu Aug 23 06:32:39 PDT 2018

> Dominik, wasn't the original idea for VAST to provide an event
> description language that would create the link between the values
> coming over the wire and their interpretation? Such a specification
> could be auto-generated from Bro's knowledge about the events it
> generates.

We were actually thinking about auto-generating the schema. But broker::data simply has no meta information that we can use. Even distinguishing records/tuples from actual lists is impossible, because broker::vector is used for both. Of course we can make a couple of assumptions (the top-level vector is a record, for example), but then VAST users only ever can use type queries. In other words, they can only ask for IP addresses for example, but not specifically for originator IPs.

In a sense, broker’s representation is an inverted JSON. In JSON, we have field names but no type information (everything is a string), whereas in broker we have (ambiguous) type information but no field names. :)

>> Though the Broker data corresponding to log entry content is also
>> opaque at the moment (I recall that was maybe for performance or
>> message volume optimization),
> 
> Yeah, but generally this is something I could see opening up. The log
> structure is pretty straight-forward and self-describing, it'd be
> mostly a matter of clean up and documentation to make that directly
> accessible to external consumers I think. Events, on the other hands,
> are semantically tied very closely to the scripts generating them, and
> also much more diverse so that self-description doesn't really seem
> feasible/useful. Republishing a relevant subset certainly sounds
> better for that; or, if it's really a bulk feed that's desired, some
> out-of-band mechanism to convey the schema information somehow.

Opening that up would be great.

However, our goal was to have Broker as a source for structured data that we can import in a generic fashion for later analysis. Of course that relies on a standard / convention / best practice for making schema programmatically accessible. Currently, it seems that we need a schema definition provided by the user offline. This will work as long as all published data for a given topic is uniform. Multiplexing multiple event types already makes things complicated, but it seems like this is actually the standard use case. OSQuery, for example, will generate different events that we than either need to separate into different topics or multiplex in a single topic but merge-in some meta information. And once we mix in meta information with actual data, a simple schema definition no longer cuts it. At worst, importing data from Broker requires a separate parser for each import format.

> broker/bro.hh is basically all there is right now

I’m a bit hesitant to rely on this header at the moment, because of:

/// A Bro log-write message. Note that at the moment this should be used only
/// by Bro itself as the arguments aren't publicly defined.

Is the API stable enough on your end at this point to make it public? Also, there are LogCreate and LogWrite events. The LogCreate has the `fields_data` (a list of field names?). Does that mean I need to receive the LogCreate even first to understand successive LogWrite events? That would mean I cannot parse logs that had their LogCreate event before I was able to subscribe to the topic.

    Dominik