Resolving Short URLs: A New Approach – Mapping Online Publics

When working with Twitter data, one of the most interesting questions is always what URLs tweets are linking to. As Twitter users discuss any given topic or issue, the URLs they share provide us with an indication of the online media they’re drawing on for information and/or entertainment – and by counting which sites appear most frequently, we’re also able to measure the relative visibility or relevance of such sites).

But of course, there’s a complication: the vast majority of URLs in tweets have been shortened using a variety of URL shorteners, and multiple short URLs may point to the same eventual target; additionally, it’s even possible – and not too uncommon – for shortener nesting to occur: for example, a bit.ly short URL might subsequently be shortened by ow.ly, and finally by t.co, in the course of retweeting. Working with the short URLs themselves is less than useful, therefore – and we must find ways to resolve them to their eventual target.

I’ve never been entirely happy with the approach to resolving short URLs which I’ve come up with previously – this used Gawk (of course) and WGet as a somewhat clunky but generally functional solution. With each URL to be resolved triggering a separate WGet call, and generating temporary files along the way, the results were neither elegant nor particularly fast, though. Speed can only be improved up to a point, of course – the script will still have to ping each short URL to be resolved at least once, to see where it points, and that process is largely dependent on Internet access and server response speeds -, but nonetheless there was plenty of room for optimisation in my previous attempts.

So, here’s a new approach to this problem. Instead of WGet, we’ll be using the command-line tool cURL here, which is able to work with batch lists of URLs to process, and can be made to send its resulting output to a single file which we can then process again. The one downside of cURL is that – contrary to WGet – it only does a single URL resolution hop; this means that where there are nested URL shorteners, our script will need to be run at least twice to resolve the final remaining short URLs (but that’s a small price to pay for greater convenience).

Update: It’s worth noting here that Twitter has recently introduced its own URL shortener, t.co, as a mandatory shortener – even links which have already been shortened using bit.ly or other tools are now shortened again to t.co URLs. Recent yourTwapperkeeper archives will contain only t.co links, therefore. This also means that at least two passes of urlresolve.awk will be required to unshorten those URLs to their eventual destinations: one pass to remove the t.co shortening, and another to resolve any remaining short URLs. You might even want to run a third pass for good measure (later passes should run considerably faster as they’ll find far fewer URLs still to resolve).

Installing cURL

So, the first step to switching over is to install cURL itself, which is available for a wide variety of platforms here. If you’re expecting your Twitter data to contain https://… (secure http) URLs in addition to standard http://… links, make sure you install an SSL-capable version of cURL. On Windows XP, I’ve found the ‘Win32 – Generic’ cURL version by Dirk Paehl to work well for me (for https support, use the version ‘with SSL’, and you’ll also need to install the openssl library available from the same site). On Windows Vista or 7, the ‘Win64 – Generic’ version by Don Luchini is fine; looks like you also need to install the Microsoft Visual C++ Redistributable package, though (details here). If you’re using a Mac, ~~I’m afraid you’re on your own as far as installation goes – a list of package options is here~~ – see Jean’s comment below for instructions.

cURL (and the openssl libraries, if you need them) need to be placed in the command path. Since – if you’ve been using any of our Gawk scripts at all so far – you’ve already installed Gawk, the easiest solution is to place curl.exe (and any openssl .dll files you may also need) in the same directory as Gawk itself. Most likely, this is C:\Program Files\GnuWin32\bin (on Windows XP) or C:\Program Files (x86)\GnuWin32\bin (on Windows Vista/7). To test whether cURL is installed and working, open a command window and try something like

curl --head --insecure https://google.com/

(change the https to http if you’ve installed a version of cURL which doesn’t do secure http). If your shell finds cURL, and cURL itself doesn’t complain that it’s missing a library somewhere, you’re ready to roll.

Resolving URLs

The first step in resolving URLs is to extract them from a Twapperkeeper/yourTwapperkeeper CSV/TSV dataset. The process for this remains exactly as before – the existing urlextract.awk script from our scripts package does this for us (and generates multiple lines in the resulting file if a tweet happens to contain multiple URLs. Simply run the script as follows:

gawk -F , -f urlextract.awk input.csv >output.csv

(I don’t have to remind you to use \t instead of , as the separator if you’re working with tab-separated files, do I?)

Now, then, it’s time to unveil our new urlresolve.awk script, which replaces the previous solution:


      
# urlresolve.awk - resolves URL shortener URLs back to full URLs
#
# resolves shortened and other redirected URLs to their actual target
# the script preserves the first line, expecting that it contains header information
#
# run urlextract.awk before running this script over the resulting file
#
# input format: CSV/TSV generated by urlextract.awk, i.e.
# url, [Twapperkeeper format]
#
# output format:
# longurl, url, [Twapperkeeper format]
#
# script requires curl to be sitting in the command path
# use an SSL-capable version of curl if you need to resolve https URLs
#
# script takes an optional argument maxlength, specifying the maximum length of URLs to be resolved
# this is especially useful for running multiple passes over the dataset - after the first run,
# maxlength can be reduced to pick up only the remaining short URLs
#
# maxlength defaults to 30 characters (including http:// etc.) if it isn't specified
# 
# script will create temporary files in the current directory (for the curl output):
# - [filename]_urllist_temp (a list of all URLs in the input CSV/TSV file
# - [filename]_urllist_temp_resolved (curl output from the URL resolution process)
#
# these files are not automatically deleted after the script finishes
#
# the script only performs one URL resolution pass - to resolve multiple nested URL wrappers
# (e.g. bit.ly links shortened a second time by t.co, etc.), run urlresolve.awk over the output file again,
# as often as desired (each round resolves an additional layer of URL shorteners
#
# NOTE: the script skips YFrog, Twitpic, Imgur, Twitgoo, and Instagram URLs - these don't resolve any further
#
# once URLs have been resolved, run urltruncate.awk over the resulting file to extract the domain names only
#
# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	if(!maxlength) maxlength = 30						# maxlength defaults to 30 characters

	getline header
	print "longurl" FS header

	getline < FILENAME													# skip header row
	while(getline < FILENAME) {												# load all URLs to be resolved - urlextract.awk stored URLs in column 1
		if(match($1,/(yfrog.us|yfrog.com|twitpic.com|imgur.com|twitgoo.com|instagr.am)/) == 0) {		# skip various image hosting short URLs - they don't resolve any further
			if(length($1) <= maxlength) urls[$1] = $1								# only short URLs of up to maxlength characters will be resolved
		}
	}
	close(FILENAME)								# finish URL detection

	tempfile = FILENAME "_urllist_temp"
	tempfile_resolved = tempfile "_resolved"

	for(url in urls) {								# generate list of URLs for curl to resolve
		print "url=\"" url "\"" > tempfile
	}

											# run curl, using the tempfile
	system("curl -K " tempfile " --head --insecure --write-out !!urlresolve!!,%{url_effective},%{redirect_url}, >" tempfile_resolved)
	close("curl -K " tempfile " --head --insecure --write-out !!urlresolve!!,%{url_effective},%{redirect_url}, >" tempfile_resolved)

	oldFS = FS									# switch to CSV format for the moment
	FS = ","

	while(getline < tempfile_resolved) {					# curl output contains the full HTML headers, as well as the resolve information
		if($1 == "!!urlresolve!!") {					# resolve information is in the format "!!urlresolve!!,[original URL],[resolved URL]"
			resolved[$2] = $3
			if(!$3) resolved[$2] = $2					# for broken or otherwise unresolved URLs, we retain the original link
		}
	}

	close(tempfile_resolved)

	FS = oldFS

}

{
	if(!($1 in resolved)) resolved[$1] = $1					# for URLs longer than maxlength, or if curl failed to run because of severely broken URLs, retain the original link
	print resolved[$1] FS $0							# match resolved URLs and original links, and print
}

This script takes the output from urlextract.awk and resolves all short URLs in the original dataset; it adds the resolved URLs in a new column 'longurl' which is inserted before the existing data. The script is called as follows:

gawk -F , -f urlresolve.awk [maxlength=x] input.csv >output.csv

The (optional) maxlength argument specifies what we consider to be a short URL, and can further speed up the processing time: since the very point of short URLs is that they're, well, short, we can assume relatively safely that comparatively long URLs already point to the final destination URL, and don't need resolving. If maxlength isn't specified, it defaults to a relatively conservative value of 30 characters; to save time, you could drop that value down to 25 or less. A typical bit.ly URL, including the 'http://' part, is 20 characters long, a URL using Facebook's shortener fb.me is 22 characters, and youtu.be URLs clock in at 27 characters, so you'll need to work out your own comfort zone here.

It's also important to note that (again also to save time) the script will automatically skip over any URLs pointing to the image hosting services YFrog, Twitpic, Imgur, Twitgoo, and Instagram. The URLs used by such sites are short, but - to the best of my knowledge - don't resolve any further, so there's no need to process them here. (If you're aware of any other widely used non-resolving short URL services, or if any of the services listed above do occasionally resolve to different URLs, please let me know!)

The urlresolve.awk script creates two temporary files in the working directory: filename_urllist_temp and filename_urllist_temp_resolved. These can be safely deleted once the script has finished, but may also be handy for spotchecking if any problems have occurred during URL resolution.

As I've mentioned, cURL will only take one step in the URL resolution process. A multiply shortened URL won't have arrived at its final destination in one pass, therefore. It's useful to inspect the output file from the first pass visually (e.g. in Excel) to check whether there still are many shortened URLs remaining in the new 'longurl' column which urlresolve.awk has added. If so, simply run the script again, using the output file from the first pass as your input:

gawk -F , -f urlresolve.awk [maxlength=x] output.csv >output2.csv

Since the first pass will have resolved many short URLs already (so that they will now be above the maxlength threshold), this second pass should conclude considerably more quickly. It will add yet another new 'longurl' column to the left of the existing data table. Repeat the process as often as necessary if any particularly obstinate cases remain.

Further Steps

Once you're happy with the URL resolution outcomes, it's easy to use the resulting dataset to find the most cited URLs or examine other relevant patterns in the data. In particular, it may also be useful to examine citation patterns not on the basis of fully qualified URLs, but by looking for the most cited domains only - this provides a better overview of which news or information sources overall were most widely used.

To truncate URLs to their domain name, use our existing urltruncate.awk script (also from our Gawk scripts package). It adds yet another new column, 'domain', to the left of the existing dataset, and is run as follows:

gawk -F , -f urltruncate.awk input.csv >output.csv

Happy resolving!

4 replies on “Resolving Short URLs: A New Approach”

Pingback: The Social Life of a t.co URL visualized « Anne Helmond
Anne Helmond says:

15 February 2012 at 19:15

I’ve recently taken a similar approach to solving short URLs with cURL for an upcoming paper, where it keeps on solving URLs until the final destination is reached. I’ve written down my approach/methodology here if you’re interested: http://www.annehelmond.nl/2012/02/14/the-social-life-of-a-t-co-url-visualized/
Jean Burgess says:

24 February 2012 at 12:14

A quick update for fellow Mac users (presuming you already have macports up and running, with gawk installed and all that).

In order to get cURL working properly (including for https:// URLs):

1. In terminal, run selfupdate to ensure you have the latest port definitions – see here: http://guide.macports.org/#using.port.selfupdate
2. install cURL with the SSL variant as follows: sudo port install curl +ssl

Then it should work perfectly with the provided script (recently tweaked to make sure it runs on the Mac, so make sure you have the latest version).
Pingback: Mapping Online Publics » Blog Archive » Does The Australian’s Paywall Affect Link Sharing?

Comments are closed.

Installing cURL

Resolving URLs

Further Steps

Published by Snurb

4 replies on “Resolving Short URLs: A New Approach”