[Bro] strategies for ingesting large number of PCAP files

Mon Jul 31 13:50:07 PDT 2017

All,

I am working with a storm topology to process a large number of PCAP files
which can be of variable sizes, but tend to be in the range of 100MB to
200MB, give or take.  My current batch to work on contains about 42K
files...I am aiming to process with as much parallelism as possible while
avoiding the issue of sessions that span more than one file (so you know
why I am doing this)

my main constraints/focus:

   1. take advantage of large number of cores (56) and RAM (~750GB) on my
   node(s)
   2. Avoid disk as much as possible (I have relatively slow spinning
   disks, though quite a few of them that can be addressed individually, which
   could mitigate the disk IO bottleneck to some degree)
   3. Prioritize completeness above all else...get as many sessions
   reconstructed as possible by stitching the packets back together in one of
   the ways below...or another if you folks have a better idea...

my thinking...and hope for suggestions on the best approach...or a
completely different one if you have a better solution:

   1. run mergecap and setup bro to run as a cluster and hope for the best
      1. *upside*: relatively simple and lowest level of effort
      2. *downside*: not sure it will scale the way I want.  I'd prefer to
      isolate Bro to running on no more than two nodes in my
cluster...each node
      has 56 cores and ~750GB RAM.  Also, it will be one more hack to have to
      work into my Storm topology
   2. use Storm topology (or something else) to re-write packets to
   individual files based on SIP/DIP/SPORT/DPORT or similar
      1. *upside*: this will ensure a certain level of parallelism and keep
      the logic inside my topology where I can control it to the greatest extent
      2. *downside*: This seems like it is horribly inefficient because I
      will have to read the PCAP files twice: once to split and once again when
      Bro get them, and again to read the Bro logs (if I don't get the Kafka
      plugins to do what I want).  Also, this will require some sort of load
      balancing to ensure that IP's that represent a disproportionate
percentage
      of traffic don't gum up the works, nor do IP's that have
relatively little
      traffic take up too many resources.  My thought here is to simply keep
      track of approximate file sizes and send IP's in rough balance (though
      still always sending any given IP/port pair to the same file).
Also, this
      makes me interact with the HDD's at least three times (once to read PCAP,
      next to write PCAP, again to read Bro logs, which is undesirable)
   3. Use Storm topology or TCP replay (or similar) to read in PCAP files,
   then write to virtual interfaces (a pool setup manually) so that Bro can
   simply listen on each interface and process as appropriate.
      1. *upside*: Seems like this could be the most efficient option as it
      probably avoids disk the most, seems like it could scale very well, and
      would support clustering by simply creating pools of interfaces
on multiple
      nodes, session-ization takes care of itself and I just need to
tell Bro to
      wait longer for packets to show up so it doesn't think the interface went
      dead if there are lulls is traffic
      2. *downside*: Most complex of the bunch and I am uncertain of my
      ability to preserve timestamps when sending the packets over the
interface
      to Bro
   4. Extend Bro to not only write directly to Kafka topics, but also to
   read from them such that I could use one of the methods above to split
   traffic and load balance and then have Bro simply spit out logs to another
   topic of my choosing
      1. *upside*: This could be the most elegant solution because it will
      allow me to handle failures and hiccups using Kafka offsets
      2. *downside*: This is easily the most difficult to implement for me
      as I have not messed with extending Bro at all.

Any suggestions or feedback would be greatly appreciated!  Thanks in
advance...

Aaron

P.S. sorry for the verbose message...but was hoping to give as complete a
problem/solution statement as I can
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170731/e8ddea26/attachment.html