[Bro] Email Link Extraction

James Lay jlay at slave-tothe-box.net
Thu Apr 18 12:13:18 PDT 2013


On 2013-04-18 12:27, Seth Hall wrote:
> On Apr 18, 2013, at 11:31 AM, James Lay <jlay at slave-tothe-box.net> 
> wrote:
>
>> Yea I'll second that...email packet captures make finding links a
>> challenge as quoted emails split the link
>
> This is far from perfect due to the reason you pointed out, but it's
> a start and this code snippet is from the next release of Bro (you
> just call find_all_urls_without_scheme with the string that you want
> to extract urls from):
>
>
> const url_regex =
> 
> /^([a-zA-Z\-]{3,5})(:\/\/[^\/?#"'\r\n><]*)([^?#"'\r\n><]*)([^[:blank:]\r\n"'><]*|\??[^"'\r\n><]*)/
> &redef;
>
> ## Extracts URLs discovered in arbitrary text.
> function find_all_urls(s: string): string_set
> 	{
> 	return find_all(s, url_regex);
> 	}
>
> ## Extracts URLs discovered in arbitrary text without
> ## the URL scheme included.
> function find_all_urls_without_scheme(s: string): string_set
> 	{
> 	local urls = find_all_urls(s);
> 	local return_urls: set[string] = set();
> 	for ( url in urls )
> 		{
> 		local no_scheme = sub(url, /^([a-zA-Z\-]{3,5})(:\/\/)/, "");
> 		add return_urls[no_scheme];
> 		}
>
> 	return return_urls;
> 	}
>
>
>
>
>   .Seth
>
> --
> Seth Hall
> International Computer Science Institute
> (Bro) because everyone has a network
> http://www.bro.org/

Thanks Seth...as I'm still horrifically newb with Bro, I'm guessing the 
above can go in local.bro?  Thank you.

James



More information about the Bro mailing list