[Bro] File Extraction: doc/xls=ok, docx/xlsx=ko

Josh Liburdi liburdi.joshua at gmail.com
Tue Feb 16 07:03:17 PST 2016


Justin's correct that the header for new Office files are very similar
to Zip files, but they do differ slightly. If you look at the code
that identifies a file as
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
it uses this regular expression:

/^PK\x03\x04.{26}(\[Content_Types\]\.xml|_rels\x2f\.rels|word\x2f).*PK\x03\x04.{26}word\x2f/

Here is the regular expression that identifies Zip files: /

^PK\x03\x04.{2}/

It's possible that your new office files aren't fitting this format.
In that case, I'd suggest adding this mime_type to your table:
application/vnd.openxmlformats-officedocument

It uses the basic new Office file header as a regular expression:

/^PK\x03\x04\x14\x00\x06\x00/

If you use application/vnd.openxmlformats-officedocument, then you
can't assume that the file has a specific extension-- all you know is
that the file is a new Office file.

Hope that clears up some confusion (and doesn't cause more)!

Josh

On Tue, Feb 16, 2016 at 9:01 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:
>
>> On Feb 16, 2016, at 6:29 AM, puntogtg at tiscali.it wrote:
>>
>> Hello,
>> I am trying to find out if I did some mistake in my extract.bro script.
>> Basically   I am able to extract the doc,pdf,xls but not the docx,xlsx,xlsm etc (all new office files).
>
> I believe the problem is that as far as bro is concerned, new office files are really .zip archives.
>
>> Script looks like this:
>>
>> global ext_map: table[string] of string = {
>>     ["application/msword"] = "doc",
>>     ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"] = "docx",
>>     ["application/vnd.openxmlformats-officedocument.wordprocessingml.template"] = "dotx",
>>
> [..]
>> } &default ="";
>>
>>
>> event file_new(f: fa_file)
>>    {
>>        if ( ! f?$mime_type  )
>>         return;
>>     local ext = "";
>>  if ( f?$mime_type )
>>         ext = ext_map[f$mime_type];
>>     #if ( ext !="pdf" && ext !="exe" && ext !="swf" )
>>     if ( ext !="doc" && ext !="docx" && ext !="dotx" && ext !="docm" && ext !="dotm" && ext !="xls" && ext !="xlsx" && ext !="xltx" && ext !="xlsm" && ext !="xltm" && ext !="xlam" && ext !="xlsb" && ext !="ppt" && ext !="pptx" && ext !="potx" && ext !="ppsx" && ext !="ppam" && ext !="pptm" && ext !="potm" && ext !="ppsm" )
>>     return;
>>
>>       local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
>>            Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
>>            break;
>> }
>
>
> Aside from the issue that docx shows up as a zip file, here is a fixed up version of that file_new event:
>
> event file_new(f: fa_file)
>    {
>     if (!f?$mime_type)
>         return;
>
>     if (f$mime_type !in ext_map)
>         return;
>
>     ext = ext_map[f$mime_type];
>     local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
>     Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
>     break;
> }
>
>
>
> --
> - Justin Azoff
>
>
>
> _______________________________________________
> Bro mailing list
> bro at bro-ids.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro



More information about the Bro mailing list