Using Gawk and Wget to Resolve URL Shorteners

Jean’s post today points to a key problem in examining user activities on Twitter and elsewhere – people are increasingly using bit.ly and other URL shorteners, which means that a) the same target URL might appear in any number of different shortened versions, and b) it’s no longer possible from a quick look at a list of URLs to select only those which are from a specific site (for example, YouTube videos).

For our purposes, that’s a significant problem – we might want to find out, for example, which were the most popular videos shared during the election campaign, the most popular articles on abc.net.au, and so on. So, we need to resolve those shortened URLs back to their original state. This could be done through the APIs of the various shortening services, of course, but with literally hundreds of different shorteners now available, that would probably require specific unshortening scripts for each service – far too much work. So, what can we do?

Once again, our favourite CSV processing tool, Gawk, can help. In fact, Gawk has some in-built networking functions, and there’s even a standard script for retrieving Web pages that might form the basis for an URL resolver tool – but unfortunately, the Windows version of Gawk doesn’t implement that networking functionality. But there’s a workaround using another open-source tool, Wget: we can call it from a Gawk script to do the URL retrieval work and check for any URL redirections in the process. (Like Gawk, Wget is available in versions for all major operation systems.)

URL Shortening is necessary in twitter. But one step at a time. In the first place, here’s linkextract.awk, a Gawk script to extract the URLs from a list of tweets (for example, a Twapperkeeper archive). Where there are multiple URLs in the same tweet, the script will create multiple lines; its output is a new CSV file which contains the URL in the first column, followed by all the original information.

      
# linkextract.awk - extract hyperlinks from Twapperkeeper export data
#
# this script takes a CSV archive of tweets in Twapperkeeper format
#
# expected data format:
# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time
#
# output format:
# link, [original fields]
#
# for tweets containing multiple links, the script adds new rows for each link
#
# Released under Creative Commons (BY, NC, SA) by Jean Burgess and Axel Bruns - je.burgess@qut.edu.au / a.bruns@qut.edu.au

BEGIN {
	getline header
	print "url," header
	IGNORECASE = 1
}

$0 ~ /http([A-Za-z0-9[:punct:]]+)/ {

a=0
do {
	match(substr($1, a),/http([A-Za-z0-9[:punct:]]+)?/,atArray)
	a=a+atArray[0, "start"]+atArray[0, "length"]

	if (atArray[0] != 0) print atArray[0] "," $0

} while(atArray[0, "start"] != 0)

}

Normally, you’d want to invoke this script using a command line like gawk -F , -f linkextract.awk input.csv >output.csv.

So far, so good. Next, we need a script that resolves the links in the new ‘url’ column in the CSV to their full URLs (and leaves intact those URLs which don’t use URL shorteners, of course). For this, we need three interlocking scripts. First, urlget,awk: this simply takes a command-line input and calls up Wget (with the –spider option, to avoid having to retrieve the whole page), and sends the Wget output to a temporary file called urltemp.tmp, in the current directory).

      
# urlget.awk - use wget to spider a given URL (passed to the script as argument)
#
# outputs wget result as file 'urltemp.tmp' to current directory
#
# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au


BEGIN {

	system("wget -t 1 -o urltemp.tmp --spider " url)

}

You could call this up using gawk -f urlget.awk -v url=http://abc.net.au/ to see what it does – open the resulting ‘urltemp.tmp’ file to see the Wget output.

The second script we need is parseurltemp.awk, which takes the Wget output and extracts from it what for our purposes is the relevant information. When called with the –spider option, Wget doesn’t retrieve the whole page at the URL, but instead simply tests that it is there and generates some header information. From this, we want to extract the original source URL (on line 2 of Wget’s output, following a retrieval timestamp), and the URL of the target page it it differs from the source. If it is different, the target appears on a line that starts with the word “Location:” – so we know what to look for here, too.

UPDATE: Looks like there are cases where Wget produces output with multiple ‘Location:’ lines. I’ve now changed parseurl.awk to stop after the first one, which seems to contain what we’re after…

      
# parseurltemp.awk - extract redirection source and target URLs from wget spider output
#
# parses standard wget --spider output
# extracts source URL from second line: e.g. --2010-08-02 14:41:47--  http://smh.com.au/
# extracts target URL from 'Location: ' statement
# sets target = source if no target is found (i.e. no redirection present)
#
# output: source,target
#
# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

{
	if((NR == 2) && (match($0,"--  "))) { 
		source = substr($0, 26, length($0) - 25)
		target = source				# in case no target location is found
	}


	if(match($0, "Location")) {
		target = substr($0, 11, length($0) - 22)
		exit
	}

}

END {
	print source "," target
}

So, parseurltemp.awk simply checks the file it is fed for those two lines, and puts them out in a simple source,target CSV format. If no redirection from the source URL is detected, target is set to the same value as source. Again, this isn’t a script we want to call manually – so to join all of this together, then, we need a third script that invokes urlget.awk and parseurltemp.awk for each line of a CSV file that contains the URLs we want to resolve: urlresolve.awk.

      
# urlresolve.awk - resolves URL shortener URls back to full URLs
#
# script requires:
# - wget to be sitting in the command path
# - path variable to be set to common directory for AWK script
# - urlget.awk and parseurltemp.awk to be in script path directory
# 
# script will create a temporary file 'urltemp.tmp' in the current directory (for the wget output)
#
# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	path = "..\\_scripts\\"		# point this to AWK script directory
	getline header
	print "longurl," header
}


{

	original = $0

	system("gawk -F , -v url=" $1 " -f " path "urlget.awk")

	"gawk -F , -f " path "parseurltemp.awk urltemp.tmp" | getline $resolved
	close("gawk -F , -f " path "parseurltemp.awk urltemp.tmp")

	print $2 "," original

}

This script, then, takes any CSV file which contains simple URLs in its first column; for each line of the CSV file, it first invokes urlget.awk to spider the source URL and generate the urltemp.tmp file, and then parseurltemp.awk to parse the temp file and identify the target of any URL redirection that may be in place. It adds the resolved target URLs to the CSV file as a new first column, called ‘longurl’ (with the resolved URL set to the original URL if no redirection was detected).

Usually, this would then be invoked using something like gawk -F , -f urlresolve.awk output.csv >resolved.csv (using the output.csv file created by the linkextract.awk script above). Depending on your own setup, you’ll also need to change the ‘path’ variable in the BEGIN statement of linkextract.awk – this should point to where all your gawk scripts are located.

So there it is – a relatively lightweight Gawk/Wget solution for resolving shortened URLs to the original. Plenty of room for improvement, no doubt – and we’d love to hear from anyone who has done any further work on these scripts. One small issue we’ve identified already: there’s a handful of sites which don’t report back full ‘Location’ URLs, but only relative ones (i.e. ‘/home/index.html’ instead of ‘http://example.com/home/index.html’). This doesn’t seem to be a particularly common problem, so I’m reluctant to spend much time adding a special fix to the scripts, but it does mean some manual tidying of the output may be necessary. UPDATE: My revised parseurl.awk seems to deal with those problems, too.

Finally, a word of warning: with large input CSVs, this URL resolution process might take considerable time, as each URL needs to be spidered individually – right now, there’s no caching of Wget results built in here (in the time it’s taken me to write this post, my script has managed to ping about 3000 URLs, for what it’s worth). Also, the rapid-fire querying of URLs might upset some ISPs – so only try this at home if you know what you’re doing, folks…

12 replies on “Using Gawk and Wget to Resolve URL Shorteners”

Pingback: Twitted by JeanBurgess
Pingback: Mapping Online Publics » Blog Archive » Top 20 election-related YouTube videos
Pingback: Mapping Online Publics » Blog Archive » Twitter’s Reaction to #twitdef – Part 2
Pingback: Mapping Online Publics » Blog Archive » Media use in the #qldfloods
Pingback: Mapping Online Publics » Blog Archive » Extracting images from Twapperkeeper archives
Hallvard says:

25 October 2011 at 12:18

Hi,
I’m trying to bring this post back to life!

First of all, thanks for the post and the scripts. I’ve tried them now on a small dataset, and they seem to work just fine. I basically have just one small question: The script that resolves shortened urls in some cases returned another shortened url. Specifically, to.c got returned as bit.ly – in as much as approx one third of the lines. I then ran the resolve script once more on the output file from the first run, which seemed to do the trick: the bit.ly’s are now all resolved and show as full urls.

Have you experienced something similar? Are there, you think, any problems with running the resolve script once more on the output file from the first round?

Thanks!
1. Axel Bruns says:
  
  26 October 2011 at 00:29
  
  Hi Hallvard,
  
  Hmm, that’s interesting. t.co is Twitter’s new mandatory URL shortener – all URLs shared via Twitter are (transparently) turned into t.co URLs now, even if they’re already shortened, like bit.ly (which gives Twitter an amazing opportunity to track what’s shared and what people click on, by the way !).
  
  Normally, wget (which is what those scripts use to resolve URLs) should work its way through to the final unshortened destination URL (which it clearly manages to do on your second pass, anyway) – perhaps there’s a time limit for further redirects that applies for wget, though, which could be why it terminates prematurely with a still-shortened URL (I seem to remember bit.ly is know to be a bit slow in returning resolved URLs, so that would make sense).
  
  Running the script a second time over the output of the first pass will be fine; it might be a bit slow if it tries to resolve all those already resolved URLs again, though, so to speed things up you might want to copy only those unresolved lines into a new CSV first…
  
  Axel
Hallvard says:

26 October 2011 at 14:51

Hi,
Thanks for the reply! The data sets I’ve dealt with so far are new, and, yes the links all have the t.co-form, so your explanation makes sense. As long as a second run solves them, and does not mess up anything else, it’s all good.
Hallvard
Pingback: Mapping Online Publics » Blog Archive » Resolving Short URLs: A New Approach
hans paijmans says:

28 February 2012 at 06:51

What about this script? It uses internal redirection of gawk (|&). BTW: note that the |& in the wget-command redirects the stderr.

awk ‘{
for (x=1;x<=NF;x++)
if (substr($x,1,4)=="http")
{
tinyurl=tinyurl" "$x
command="wget -O /dev/null "$x" |& grep Location | tail -1 "
print |& command
command |& getline url
split(url,u)
printf("%s ", u[2])
close(command)
}
else printf("%s ",$x);

printf("n");
}' text_with_tiny_urls
1. Snurb says:
  
  28 February 2012 at 16:55
  
  Hi Hans,
  
  thanks for this. Yes, in principle that should work fine as well. It does need grep and tail to be available as command-line tools as well, though (which they aren’t by default on a PC – and probably not on a Mac, either ?), and assumes that things like /dev/null are available as well. So, you’ll run into trouble with the portability of this approach across different environments.
  
  My new approach (using gawk and curl as the only non-standard tools – see http://www.mappingonlinepublics.net/2012/02/10/resolving-short-urls-a-new-approach/) seems easier than this – especially because on PC/Mac machines you only need those two tools, and not grep and tail as well…
  
  Axel
hans paijmans says:

29 February 2012 at 00:14

@Snurb: yes, but I am a citizen of the Linux/Unix galaxy where grep and tail are indigenous 🙂

In any case I think that my approach causes less and lighter processes to be started, and therefore is faster, but I may be mistaken.

paai

Comments are closed.

Published by Snurb

12 replies on “Using Gawk and Wget to Resolve URL Shorteners”