[Bro] strategies for ingesting large number of PCAP files

Mon Jul 31 14:14:41 PDT 2017

This is not for Metron...I am doing something different...research...I
could see the read from a directory approach, but unfortunately, I cannot
control the file naming scheme nor can I be certain the "last-modified"
timestamp will be entirely reliable.  That is why I was trying to deal with
the packets and their respective timestamps directly...

I write code in Scala/Java, and Perl, but never have written anything in
Python...which is also a hurdle I will have to deal with...(but am willing
to deal with if needed)

On Mon, Jul 31, 2017 at 5:06 PM, Azoff, Justin S <jazoff at illinois.edu>
wrote:

> Is this for that Metron thing?  I had chatted with someone on irc about
> this a ways back.  I think the simplest way to integrate bro would be to
> write a 'pcapdir' packet source plugin that works similar to the 'live'
> pcap mode, but instead reads packets from sequentially numbered pcap files
> in a directory.  That way fetching the next packet would boil down to, in
> pythonish pseudocode
>
>
>    loop:
>         while not current file:
>             current file = find next pcap file
>             sleep for 100ms
>         packet = get next packet from current pcap file
>         if packet:
>             return packet
>         close current pcap file
>         delete current pcap file
>         current file = None
>
> packet source plugins are pretty easy to write, you could probably get
> this working in a few hours.  It would be a lot easier to implement than
> interfacing with kafka directly.
>
> Then you just need to atomically move pcap files into the directory that
> bro is watching.  Since a single instance of bro is running you don't have
> to worry about sessions that span more than one file... that should just
> work normally.
>
> To use more than one bro process you would just need to write a tool that
> can read pcaps, hash by 2 or 4 tuple, and output sliced pcaps to different
> places.
>
> You should be able to do everything but the last part without pcaps ever
> touching the disk.
>
> You probably want to avoid using something like tcpreplay since you'd lose
> a lot of the performance benefits of bro reading from pcap files.
>
> --
> - Justin Azoff
>
> > On Jul 31, 2017, at 4:50 PM, M. Aaron Bossert <mabossert at gmail.com>
> wrote:
> >
> > All,
> >
> > I am working with a storm topology to process a large number of PCAP
> files which can be of variable sizes, but tend to be in the range of 100MB
> to 200MB, give or take.  My current batch to work on contains about 42K
> files...I am aiming to process with as much parallelism as possible while
> avoiding the issue of sessions that span more than one file (so you know
> why I am doing this)
> >
> > my main constraints/focus:
> >       • take advantage of large number of cores (56) and RAM (~750GB) on
> my node(s)
> >       • Avoid disk as much as possible (I have relatively slow spinning
> disks, though quite a few of them that can be addressed individually, which
> could mitigate the disk IO bottleneck to some degree)
> >       • Prioritize completeness above all else...get as many sessions
> reconstructed as possible by stitching the packets back together in one of
> the ways below...or another if you folks have a better idea...
> >
> > my thinking...and hope for suggestions on the best approach...or a
> completely different one if you have a better solution:
> >
> >       • run mergecap and setup bro to run as a cluster and hope for the
> best
> >               • upside: relatively simple and lowest level of effort
> >               • downside: not sure it will scale the way I want.  I'd
> prefer to isolate Bro to running on no more than two nodes in my
> cluster...each node has 56 cores and ~750GB RAM.  Also, it will be one more
> hack to have to work into my Storm topology
> >       • use Storm topology (or something else) to re-write packets to
> individual files based on SIP/DIP/SPORT/DPORT or similar
> >               • upside: this will ensure a certain level of parallelism
> and keep the logic inside my topology where I can control it to the
> greatest extent
> >               • downside: This seems like it is horribly inefficient
> because I will have to read the PCAP files twice: once to split and once
> again when Bro get them, and again to read the Bro logs (if I don't get the
> Kafka plugins to do what I want).  Also, this will require some sort of
> load balancing to ensure that IP's that represent a disproportionate
> percentage of traffic don't gum up the works, nor do IP's that have
> relatively little traffic take up too many resources.  My thought here is
> to simply keep track of approximate file sizes and send IP's in rough
> balance (though still always sending any given IP/port pair to the same
> file).  Also, this makes me interact with the HDD's at least three times
> (once to read PCAP, next to write PCAP, again to read Bro logs, which is
> undesirable)
> >       • Use Storm topology or TCP replay (or similar) to read in PCAP
> files, then write to virtual interfaces (a pool setup manually) so that Bro
> can simply listen on each interface and process as appropriate.
> >               • upside: Seems like this could be the most efficient
> option as it probably avoids disk the most, seems like it could scale very
> well, and would support clustering by simply creating pools of interfaces
> on multiple nodes, session-ization takes care of itself and I just need to
> tell Bro to wait longer for packets to show up so it doesn't think the
> interface went dead if there are lulls is traffic
> >               • downside: Most complex of the bunch and I am uncertain
> of my ability to preserve timestamps when sending the packets over the
> interface to Bro
> >       • Extend Bro to not only write directly to Kafka topics, but also
> to read from them such that I could use one of the methods above to split
> traffic and load balance and then have Bro simply spit out logs to another
> topic of my choosing
> >               • upside: This could be the most elegant solution because
> it will allow me to handle failures and hiccups using Kafka offsets
> >               • downside: This is easily the most difficult to implement
> for me as I have not messed with extending Bro at all.
> > Any suggestions or feedback would be greatly appreciated!  Thanks in
> advance...
> >
> > Aaron
> >
> > P.S. sorry for the verbose message...but was hoping to give as complete
> a problem/solution statement as I can
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170731/c9d5db97/attachment-0001.html