[Bro] strategies for ingesting large number of PCAP files
Azoff, Justin S
jazoff at illinois.edu
Mon Jul 31 14:06:50 PDT 2017
Is this for that Metron thing? I had chatted with someone on irc about this a ways back. I think the simplest way to integrate bro would be to write a 'pcapdir' packet source plugin that works similar to the 'live' pcap mode, but instead reads packets from sequentially numbered pcap files in a directory. That way fetching the next packet would boil down to, in pythonish pseudocode
while not current file:
current file = find next pcap file
sleep for 100ms
packet = get next packet from current pcap file
close current pcap file
delete current pcap file
current file = None
packet source plugins are pretty easy to write, you could probably get this working in a few hours. It would be a lot easier to implement than interfacing with kafka directly.
Then you just need to atomically move pcap files into the directory that bro is watching. Since a single instance of bro is running you don't have to worry about sessions that span more than one file... that should just work normally.
To use more than one bro process you would just need to write a tool that can read pcaps, hash by 2 or 4 tuple, and output sliced pcaps to different places.
You should be able to do everything but the last part without pcaps ever touching the disk.
You probably want to avoid using something like tcpreplay since you'd lose a lot of the performance benefits of bro reading from pcap files.
- Justin Azoff
> On Jul 31, 2017, at 4:50 PM, M. Aaron Bossert <mabossert at gmail.com> wrote:
> I am working with a storm topology to process a large number of PCAP files which can be of variable sizes, but tend to be in the range of 100MB to 200MB, give or take. My current batch to work on contains about 42K files...I am aiming to process with as much parallelism as possible while avoiding the issue of sessions that span more than one file (so you know why I am doing this)
> my main constraints/focus:
> • take advantage of large number of cores (56) and RAM (~750GB) on my node(s)
> • Avoid disk as much as possible (I have relatively slow spinning disks, though quite a few of them that can be addressed individually, which could mitigate the disk IO bottleneck to some degree)
> • Prioritize completeness above all else...get as many sessions reconstructed as possible by stitching the packets back together in one of the ways below...or another if you folks have a better idea...
> my thinking...and hope for suggestions on the best approach...or a completely different one if you have a better solution:
> • run mergecap and setup bro to run as a cluster and hope for the best
> • upside: relatively simple and lowest level of effort
> • downside: not sure it will scale the way I want. I'd prefer to isolate Bro to running on no more than two nodes in my cluster...each node has 56 cores and ~750GB RAM. Also, it will be one more hack to have to work into my Storm topology
> • use Storm topology (or something else) to re-write packets to individual files based on SIP/DIP/SPORT/DPORT or similar
> • upside: this will ensure a certain level of parallelism and keep the logic inside my topology where I can control it to the greatest extent
> • downside: This seems like it is horribly inefficient because I will have to read the PCAP files twice: once to split and once again when Bro get them, and again to read the Bro logs (if I don't get the Kafka plugins to do what I want). Also, this will require some sort of load balancing to ensure that IP's that represent a disproportionate percentage of traffic don't gum up the works, nor do IP's that have relatively little traffic take up too many resources. My thought here is to simply keep track of approximate file sizes and send IP's in rough balance (though still always sending any given IP/port pair to the same file). Also, this makes me interact with the HDD's at least three times (once to read PCAP, next to write PCAP, again to read Bro logs, which is undesirable)
> • Use Storm topology or TCP replay (or similar) to read in PCAP files, then write to virtual interfaces (a pool setup manually) so that Bro can simply listen on each interface and process as appropriate.
> • upside: Seems like this could be the most efficient option as it probably avoids disk the most, seems like it could scale very well, and would support clustering by simply creating pools of interfaces on multiple nodes, session-ization takes care of itself and I just need to tell Bro to wait longer for packets to show up so it doesn't think the interface went dead if there are lulls is traffic
> • downside: Most complex of the bunch and I am uncertain of my ability to preserve timestamps when sending the packets over the interface to Bro
> • Extend Bro to not only write directly to Kafka topics, but also to read from them such that I could use one of the methods above to split traffic and load balance and then have Bro simply spit out logs to another topic of my choosing
> • upside: This could be the most elegant solution because it will allow me to handle failures and hiccups using Kafka offsets
> • downside: This is easily the most difficult to implement for me as I have not messed with extending Bro at all.
> Any suggestions or feedback would be greatly appreciated! Thanks in advance...
> P.S. sorry for the verbose message...but was hoping to give as complete a problem/solution statement as I can
More information about the Bro