[Bro] strategies for ingesting large number of PCAP files

Mon Jul 31 14:21:38 PDT 2017

Sorry, missed the last part...I already have code written (Storm topology)
that reads the PCAP files and sends them to the follow--on bolt to write to
disk...or whatever...it already hashes on the 2/4/5 tuple (configurable).
I am using pcap4j for that...SO, writing files to disk is trivial and
mostly done...but I kept getting heartburn when I was looking at the disk
IO implications (read,write,read,write,read,send to Kafka...which is
another write once it persists).  This approach also means that I am stuck
with having to probably read PCAP twice...once to process for Bro and once
to process the PCAP itself, part of which involves an in-memory JOIN of
sorts that associates each Bro log (conn.log) to the individual packets
that made up the session.

On Mon, Jul 31, 2017 at 5:14 PM, M. Aaron Bossert <mabossert at gmail.com>
wrote:

> This is not for Metron...I am doing something different...research...I
> could see the read from a directory approach, but unfortunately, I cannot
> control the file naming scheme nor can I be certain the "last-modified"
> timestamp will be entirely reliable.  That is why I was trying to deal with
> the packets and their respective timestamps directly...
>
> I write code in Scala/Java, and Perl, but never have written anything in
> Python...which is also a hurdle I will have to deal with...(but am willing
> to deal with if needed)
>
> On Mon, Jul 31, 2017 at 5:06 PM, Azoff, Justin S <jazoff at illinois.edu>
> wrote:
>
>> Is this for that Metron thing?  I had chatted with someone on irc about
>> this a ways back.  I think the simplest way to integrate bro would be to
>> write a 'pcapdir' packet source plugin that works similar to the 'live'
>> pcap mode, but instead reads packets from sequentially numbered pcap files
>> in a directory.  That way fetching the next packet would boil down to, in
>> pythonish pseudocode
>>
>>
>>    loop:
>>         while not current file:
>>             current file = find next pcap file
>>             sleep for 100ms
>>         packet = get next packet from current pcap file
>>         if packet:
>>             return packet
>>         close current pcap file
>>         delete current pcap file
>>         current file = None
>>
>> packet source plugins are pretty easy to write, you could probably get
>> this working in a few hours.  It would be a lot easier to implement than
>> interfacing with kafka directly.
>>
>> Then you just need to atomically move pcap files into the directory that
>> bro is watching.  Since a single instance of bro is running you don't have
>> to worry about sessions that span more than one file... that should just
>> work normally.
>>
>> To use more than one bro process you would just need to write a tool that
>> can read pcaps, hash by 2 or 4 tuple, and output sliced pcaps to different
>> places.
>>
>> You should be able to do everything but the last part without pcaps ever
>> touching the disk.
>>
>> You probably want to avoid using something like tcpreplay since you'd
>> lose a lot of the performance benefits of bro reading from pcap files.
>>
>> --
>> - Justin Azoff
>>
>> > On Jul 31, 2017, at 4:50 PM, M. Aaron Bossert <mabossert at gmail.com>
>> wrote:
>> >
>> > All,
>> >
>> > I am working with a storm topology to process a large number of PCAP
>> files which can be of variable sizes, but tend to be in the range of 100MB
>> to 200MB, give or take.  My current batch to work on contains about 42K
>> files...I am aiming to process with as much parallelism as possible while
>> avoiding the issue of sessions that span more than one file (so you know
>> why I am doing this)
>> >
>> > my main constraints/focus:
>> >       • take advantage of large number of cores (56) and RAM (~750GB)
>> on my node(s)
>> >       • Avoid disk as much as possible (I have relatively slow spinning
>> disks, though quite a few of them that can be addressed individually, which
>> could mitigate the disk IO bottleneck to some degree)
>> >       • Prioritize completeness above all else...get as many sessions
>> reconstructed as possible by stitching the packets back together in one of
>> the ways below...or another if you folks have a better idea...
>> >
>> > my thinking...and hope for suggestions on the best approach...or a
>> completely different one if you have a better solution:
>> >
>> >       • run mergecap and setup bro to run as a cluster and hope for the
>> best
>> >               • upside: relatively simple and lowest level of effort
>> >               • downside: not sure it will scale the way I want.  I'd
>> prefer to isolate Bro to running on no more than two nodes in my
>> cluster...each node has 56 cores and ~750GB RAM.  Also, it will be one more
>> hack to have to work into my Storm topology
>> >       • use Storm topology (or something else) to re-write packets to
>> individual files based on SIP/DIP/SPORT/DPORT or similar
>> >               • upside: this will ensure a certain level of parallelism
>> and keep the logic inside my topology where I can control it to the
>> greatest extent
>> >               • downside: This seems like it is horribly inefficient
>> because I will have to read the PCAP files twice: once to split and once
>> again when Bro get them, and again to read the Bro logs (if I don't get the
>> Kafka plugins to do what I want).  Also, this will require some sort of
>> load balancing to ensure that IP's that represent a disproportionate
>> percentage of traffic don't gum up the works, nor do IP's that have
>> relatively little traffic take up too many resources.  My thought here is
>> to simply keep track of approximate file sizes and send IP's in rough
>> balance (though still always sending any given IP/port pair to the same
>> file).  Also, this makes me interact with the HDD's at least three times
>> (once to read PCAP, next to write PCAP, again to read Bro logs, which is
>> undesirable)
>> >       • Use Storm topology or TCP replay (or similar) to read in PCAP
>> files, then write to virtual interfaces (a pool setup manually) so that Bro
>> can simply listen on each interface and process as appropriate.
>> >               • upside: Seems like this could be the most efficient
>> option as it probably avoids disk the most, seems like it could scale very
>> well, and would support clustering by simply creating pools of interfaces
>> on multiple nodes, session-ization takes care of itself and I just need to
>> tell Bro to wait longer for packets to show up so it doesn't think the
>> interface went dead if there are lulls is traffic
>> >               • downside: Most complex of the bunch and I am uncertain
>> of my ability to preserve timestamps when sending the packets over the
>> interface to Bro
>> >       • Extend Bro to not only write directly to Kafka topics, but also
>> to read from them such that I could use one of the methods above to split
>> traffic and load balance and then have Bro simply spit out logs to another
>> topic of my choosing
>> >               • upside: This could be the most elegant solution because
>> it will allow me to handle failures and hiccups using Kafka offsets
>> >               • downside: This is easily the most difficult to
>> implement for me as I have not messed with extending Bro at all.
>> > Any suggestions or feedback would be greatly appreciated!  Thanks in
>> advance...
>> >
>> > Aaron
>> >
>> > P.S. sorry for the verbose message...but was hoping to give as complete
>> a problem/solution statement as I can
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20170731/24bc5dca/attachment.html