[Bro] Email Link Extraction

Seth Hall seth at icir.org
Thu Apr 18 11:27:43 PDT 2013


On Apr 18, 2013, at 11:31 AM, James Lay <jlay at slave-tothe-box.net> wrote:

> Yea I'll second that...email packet captures make finding links a 
> challenge as quoted emails split the link

This is far from perfect due to the reason you pointed out, but it's a start and this code snippet is from the next release of Bro (you just call find_all_urls_without_scheme with the string that you want to extract urls from):


const url_regex = /^([a-zA-Z\-]{3,5})(:\/\/[^\/?#"'\r\n><]*)([^?#"'\r\n><]*)([^[:blank:]\r\n"'><]*|\??[^"'\r\n><]*)/ &redef;

## Extracts URLs discovered in arbitrary text.
function find_all_urls(s: string): string_set
	{
	return find_all(s, url_regex);
	}

## Extracts URLs discovered in arbitrary text without
## the URL scheme included.
function find_all_urls_without_scheme(s: string): string_set
	{
	local urls = find_all_urls(s);
	local return_urls: set[string] = set();
	for ( url in urls )
		{
		local no_scheme = sub(url, /^([a-zA-Z\-]{3,5})(:\/\/)/, "");
		add return_urls[no_scheme];
		}

	return return_urls;
	}




  .Seth

--
Seth Hall
International Computer Science Institute
(Bro) because everyone has a network
http://www.bro.org/





More information about the Bro mailing list