[Zeek] Some issues with find_all_urls() function

Jon Siwek jsiwek at corelight.com
Tue Aug 13 09:48:44 PDT 2019


On Tue, Aug 13, 2019 at 7:00 AM Jonah Burgess <jburgess03 at qub.ac.uk> wrote:

> 1) When trying to resolve some of these issues, should I directly modify urls.zeek or will this have unintended consequences regarding other scripts/functionality in Zeek? The reason I ask this is when printing URLs extracted with the find_all_urls() function I get some results which are clearly not valid URLs e.g. "http://www.yootheme.com/license) */" - this should have cut off before the ")" which I believe are bug with urls.zeek rather than simply being intended functionality that I'd like to change.

Since the types of changes you'd be doing are purely bug fixes, it
would be great it you fork Zeek on GitHub, modify urls.zeek directly
with your fixes, and then submit a Pull Request so everyone can
benefit from the improvements.

> 2) Assuming I don't manage to fix all of these errors and choose to accept some, how can I stop them from printing to console each time I process a PCAP?

You could do:

    redef Reporter::errors_to_stderr=F;

(But these errors are symptoms of real bugs that we'd want to
eventually fix in urls.zeek)

> 3) While trying to fix some of these errors with regex, I ran into the example "www.iec.ch\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16IEC http". I've tried to strip everthing after the first "\" but this doesn't work due to it being Hex (I guess) rather than an actual "\", any ideas for this specific case?

"Strip everything after first non-printable character" could look like:

    print gsub(input_string, /[^[:print:]].*$/, "");

> 4) Finally, a regex related question I've been meaning to ask for a while. Because I'm trying to extract URLs from HTML/JS, I need to deal with cases whitespace and multiple types of quote character may be used. When I've written projects in Python, I would create a variable with all of the possible characters in it and then I would use this variable in the regex e.g.

The pattern conjunction/concatenation operator, &, may do what you
want, docs at:

    https://docs.zeek.org/en/latest/script-reference/types.html#type-pattern

The example shown there for combining patterns:

    /foo/ & /bar/ in "foobar"

- Jon



More information about the Zeek mailing list