[Bro] File Extraction Related Scripting Questions

Fri Sep 19 08:57:49 PDT 2014

Seth and group:

As a tangental (but related) topic, it appears that the MIME type
'application/msword' is possibly being used to universally classify all
legacy MS Office filetypes (doc, ppt, xls, etc), instead of just Office
documents with the .doc extension. From what I understand, the MIME type
'application/msword' is specifically used for 'doc' files and is not
universal to describe any or all Office documents.

The following table supports this notion, as you can see different
extentions within office have thier own respective MIME type.

http://filext.com/faq/office_mime_types.php

In testing this, I had a script that pulled files off the wire that matched
the 'msword' MIME type, waited, then reviewed. I ended up with a diverse
collection of office documents (doc/xls).

It may be purposeful, since all OLECF files have the same magic (D0 CF 11
E0 A1 B1 1A E1). Is this the case? Would it be more appropriate/clear to
have a MIME type such as 'application/ole'? Additionally, if you look 512
bytes in you can determine the type of file for older office documents. Is
this an opportunity to create clearer, more specific file type signatures?

I am certainly not an authority on this matter, but would appreciate any
insight into the topic as it will help drive the direction of a solution I
am developing.

Thanks,
Jason

On Thu, Sep 18, 2014 at 4:26 PM, Jason Batchelor <jxbatchelor at gmail.com>
wrote:

> Thanks Seth,
>
> Is it expected for Bro to detect specific Office documents? Older versions
> of Office documents (2003 and back I believe) had easily identifiable file
> magic that one could use to at least inform the observation that the file
> is an MS office document (ex: D0 CF 11 E0 A1 B1 1A E1).
>
> Now however, with 'office open xml', the initial file magic for modern
> Office documents is equivilent to that of a zip file at face value.
> However, if properly insturmented you can deliniate between a regular ZIP
> file and an Office document, down to being able to state if one is a pptx,
> docx, etc...
>
> When extracting files off the wire that are what I believe to be modern MS
> Office documents, Bro tends to classify them as ZIP files via the MIME
> type. This is technically correct, however there are higher fidelity
> attributes that may be absent, such as the fact that the file is an Office
> document for Word, Powerpoint, etc.
>
> I did a little playing around to see if perhaps Bro was simply claiming
> the 'ZIP' file magic identification was the strongest - and the other more
> application centric file magic identifiers were burried in the variable
> f$mime_types mime_matches vector as defined here:
>
>
> https://www.bro.org/sphinx-git/scripts/base/init-bare.bro.html#type-mime_matches
>
> After running , it doesn't seem that Bro is identifying this. Below is the
> snip of code I am using to attempt this:
>
>    if ( f?$mime_types && |f$mime_types| == 2 )
>    {
>       if ( f$mime_types[0]$mime == "application/zip" )
>       {
>          ext = ext_map_zip_subset[f$mime_types[1]$mime];
>       }
>    }
> The subset mime types I have defined to map the specific versions of MS
> Office are as follows:
>
> application/vnd.openxmlformats-officedocument.wordprocessingml.document
> application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
> application/vnd.openxmlformats-officedocument.presentationml.presentation
>
> Also curious about JAR files too, those seem to fold into the broader ZIP
> file magic but may be detected with something a little more specific.
>
> The magic/type list posted here (under 'zip') perhaps better illustrates
> the issue:
> http://www.garykessler.net/library/file_sigs.html
>
> Hope that helps explain better,
> Jason
>
>
>
>
>
> On Thu, Sep 18, 2014 at 2:47 PM, Seth Hall <seth at icir.org> wrote:
>
>>
>> On Sep 18, 2014, at 1:12 PM, Jason Batchelor <jxbatchelor at gmail.com>
>> wrote:
>>
>> > FWIW - I managed to cobble together the following poc once I stumbled
>> across 'exec' :)
>>
>> Yep, that's probably the correct thing to do for now.
>>
>> > The drawback is this required the use of 'when' which requires me to
>> wait a little bit before I can utilize the returned result.
>>
>> Since Bro needs to keep running in a non-blocking manner all the time,
>> basically any solution you aim for will be using when since looking at the
>> file system is almost intrinsically a blocking operation.
>>
>> What I would recommend is that you have a scheduled event that regularly
>> checks the size of the directory and modifies a global value to let you
>> know if you're safe to extract or not.  That will combine the benefit of
>> the asynchronous operation with the benefit of being able to check in an if
>> statement if your extraction directory is overly full.
>>
>> > It also seems that if I place an 'extract' analyzer inside the if
>> statement when a file can be written, I get the error 'field value missing
>> [dir_size$stdout]'. This probably relates to a timing issue on the part of
>> the issued command I am guessing?
>>
>> I'm not sure why you're seeing that problem, that seems weird.  However,
>> I wouldn't expect that to work generally because once the when statement
>> returns could be after quite a bit of the file has already transferred.
>>
>> >  Also interested as well in the MIME type question with respect to
>> Office documents.
>>
>> Yeah, what would help a lot there is for someone to pull together files
>> that they don't feel are being detected with accurate mime types and to
>> provide those files or links to files on the internet that don't get
>> detected accurately.
>>
>>   .Seth
>>
>> --
>> Seth Hall
>> International Computer Science Institute
>> (Bro) because everyone has a network
>> http://www.bro.org/
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ICSI.Berkeley.EDU/pipermail/bro/attachments/20140919/ced0e0ba/attachment.html