[Bro] File Extraction: doc/xls=ok, docx/xlsx=ko
Josh Liburdi
liburdi.joshua at gmail.com
Tue Feb 16 07:03:17 PST 2016
Justin's correct that the header for new Office files are very similar
to Zip files, but they do differ slightly. If you look at the code
that identifies a file as
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
it uses this regular expression:
/^PK\x03\x04.{26}(\[Content_Types\]\.xml|_rels\x2f\.rels|word\x2f).*PK\x03\x04.{26}word\x2f/
Here is the regular expression that identifies Zip files: /
^PK\x03\x04.{2}/
It's possible that your new office files aren't fitting this format.
In that case, I'd suggest adding this mime_type to your table:
application/vnd.openxmlformats-officedocument
It uses the basic new Office file header as a regular expression:
/^PK\x03\x04\x14\x00\x06\x00/
If you use application/vnd.openxmlformats-officedocument, then you
can't assume that the file has a specific extension-- all you know is
that the file is a new Office file.
Hope that clears up some confusion (and doesn't cause more)!
Josh
On Tue, Feb 16, 2016 at 9:01 AM, Azoff, Justin S <jazoff at illinois.edu> wrote:
>
>> On Feb 16, 2016, at 6:29 AM, puntogtg at tiscali.it wrote:
>>
>> Hello,
>> I am trying to find out if I did some mistake in my extract.bro script.
>> Basically I am able to extract the doc,pdf,xls but not the docx,xlsx,xlsm etc (all new office files).
>
> I believe the problem is that as far as bro is concerned, new office files are really .zip archives.
>
>> Script looks like this:
>>
>> global ext_map: table[string] of string = {
>> ["application/msword"] = "doc",
>> ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"] = "docx",
>> ["application/vnd.openxmlformats-officedocument.wordprocessingml.template"] = "dotx",
>>
> [..]
>> } &default ="";
>>
>>
>> event file_new(f: fa_file)
>> {
>> if ( ! f?$mime_type )
>> return;
>> local ext = "";
>> if ( f?$mime_type )
>> ext = ext_map[f$mime_type];
>> #if ( ext !="pdf" && ext !="exe" && ext !="swf" )
>> if ( ext !="doc" && ext !="docx" && ext !="dotx" && ext !="docm" && ext !="dotm" && ext !="xls" && ext !="xlsx" && ext !="xltx" && ext !="xlsm" && ext !="xltm" && ext !="xlam" && ext !="xlsb" && ext !="ppt" && ext !="pptx" && ext !="potx" && ext !="ppsx" && ext !="ppam" && ext !="pptm" && ext !="potm" && ext !="ppsm" )
>> return;
>>
>> local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
>> Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
>> break;
>> }
>
>
> Aside from the issue that docx shows up as a zip file, here is a fixed up version of that file_new event:
>
> event file_new(f: fa_file)
> {
> if (!f?$mime_type)
> return;
>
> if (f$mime_type !in ext_map)
> return;
>
> ext = ext_map[f$mime_type];
> local fname = fmt("/bro/extracted/%s-%s.%s", f$source, f$id, ext);
> Files::add_analyzer(f, Files::ANALYZER_EXTRACT, [$extract_filename=fname]);
> break;
> }
>
>
>
> --
> - Justin Azoff
>
>
>
> _______________________________________________
> Bro mailing list
> bro at bro-ids.org
> http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/bro
More information about the Bro
mailing list