[Bro] File Extraction Related Scripting Questions

Jason Batchelor jxbatchelor at gmail.com
Thu Sep 18 14:26:25 PDT 2014


Thanks Seth,

Is it expected for Bro to detect specific Office documents? Older versions
of Office documents (2003 and back I believe) had easily identifiable file
magic that one could use to at least inform the observation that the file
is an MS office document (ex: D0 CF 11 E0 A1 B1 1A E1).

Now however, with 'office open xml', the initial file magic for modern
Office documents is equivilent to that of a zip file at face value.
However, if properly insturmented you can deliniate between a regular ZIP
file and an Office document, down to being able to state if one is a pptx,
docx, etc...

When extracting files off the wire that are what I believe to be modern MS
Office documents, Bro tends to classify them as ZIP files via the MIME
type. This is technically correct, however there are higher fidelity
attributes that may be absent, such as the fact that the file is an Office
document for Word, Powerpoint, etc.

I did a little playing around to see if perhaps Bro was simply claiming the
'ZIP' file magic identification was the strongest - and the other more
application centric file magic identifiers were burried in the variable
f$mime_types mime_matches vector as defined here:

https://www.bro.org/sphinx-git/scripts/base/init-bare.bro.html#type-mime_matches

After running , it doesn't seem that Bro is identifying this. Below is the
snip of code I am using to attempt this:

   if ( f?$mime_types && |f$mime_types| == 2 )
   {
      if ( f$mime_types[0]$mime == "application/zip" )
      {
         ext = ext_map_zip_subset[f$mime_types[1]$mime];
      }
   }
The subset mime types I have defined to map the specific versions of MS
Office are as follows:

application/vnd.openxmlformats-officedocument.wordprocessingml.document
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
application/vnd.openxmlformats-officedocument.presentationml.presentation

Also curious about JAR files too, those seem to fold into the broader ZIP
file magic but may be detected with something a little more specific.

The magic/type list posted here (under 'zip') perhaps better illustrates
the issue:
http://www.garykessler.net/library/file_sigs.html

Hope that helps explain better,
Jason





On Thu, Sep 18, 2014 at 2:47 PM, Seth Hall <seth at icir.org> wrote:

>
> On Sep 18, 2014, at 1:12 PM, Jason Batchelor <jxbatchelor at gmail.com>
> wrote:
>
> > FWIW - I managed to cobble together the following poc once I stumbled
> across 'exec' :)
>
> Yep, that's probably the correct thing to do for now.
>
> > The drawback is this required the use of 'when' which requires me to
> wait a little bit before I can utilize the returned result.
>
> Since Bro needs to keep running in a non-blocking manner all the time,
> basically any solution you aim for will be using when since looking at the
> file system is almost intrinsically a blocking operation.
>
> What I would recommend is that you have a scheduled event that regularly
> checks the size of the directory and modifies a global value to let you
> know if you're safe to extract or not.  That will combine the benefit of
> the asynchronous operation with the benefit of being able to check in an if
> statement if your extraction directory is overly full.
>
> > It also seems that if I place an 'extract' analyzer inside the if
> statement when a file can be written, I get the error 'field value missing
> [dir_size$stdout]'. This probably relates to a timing issue on the part of
> the issued command I am guessing?
>
> I'm not sure why you're seeing that problem, that seems weird.  However, I
> wouldn't expect that to work generally because once the when statement
> returns could be after quite a bit of the file has already transferred.
>
> >  Also interested as well in the MIME type question with respect to
> Office documents.
>
> Yeah, what would help a lot there is for someone to pull together files
> that they don't feel are being detected with accurate mime types and to
> provide those files or links to files on the internet that don't get
> detected accurately.
>
>   .Seth
>
> --
> Seth Hall
> International Computer Science Institute
> (Bro) because everyone has a network
> http://www.bro.org/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140918/f9decca5/attachment.html 


More information about the Bro mailing list