{"id":1072,"date":"2012-02-10T15:42:55","date_gmt":"2012-02-10T05:42:55","guid":{"rendered":"http:\/\/www.mappingonlinepublics.net\/dev\/?p=1072"},"modified":"2012-04-10T14:17:49","modified_gmt":"2012-04-10T04:17:49","slug":"resolving-short-urls-a-new-approach","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2012\/02\/10\/resolving-short-urls-a-new-approach\/","title":{"rendered":"Resolving Short URLs: A New Approach"},"content":{"rendered":"<p>When working with <em>Twitter<\/em> data, one of the most interesting questions is always what URLs tweets are linking to. As <em>Twitter<\/em> users discuss any given topic or issue, the URLs they share provide us with an indication of the online media they&#8217;re drawing on for information and\/or entertainment &#8211; and by counting which sites appear most frequently, we&#8217;re also able to measure the relative visibility or relevance of such sites).<\/p>\n<p>But of course, there&#8217;s a complication: the vast majority of URLs in tweets have been shortened using a variety of URL shorteners, and multiple short URLs may point to the same eventual target; additionally, it&#8217;s even possible &#8211; and not too uncommon &#8211; for shortener nesting to occur: for example, a <em>bit.ly<\/em> short URL might subsequently be shortened by <em>ow.ly<\/em>, and finally by <em>t.co<\/em>, in the course of retweeting. Working with the short URLs themselves is less than useful, therefore &#8211; and we must find ways to resolve them to their eventual target.<\/p>\n<p><!--more--><\/p>\n<p>I&#8217;ve never been entirely happy with the approach to resolving short URLs <a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/08\/02\/using-gawk-to-resolve-url-shorteners\/\">which I&#8217;ve come up with previously<\/a> &#8211; this used Gawk (of course) and WGet as a somewhat clunky but generally functional solution. With each URL to be resolved triggering a separate WGet call, and generating temporary files along the way, the results were neither elegant nor particularly fast, though. Speed can only be improved up to a point, of course &#8211; the script will still have to ping each short URL to be resolved at least once, to see where it points, and that process is largely dependent on Internet access and server response speeds -, but nonetheless there was plenty of room for optimisation in my previous attempts.<\/p>\n<p>So, here&#8217;s a new approach to this problem. Instead of WGet, we&#8217;ll be using the command-line tool <a href=\"http:\/\/curl.haxx.se\/\">cURL<\/a> here, which is able to work with batch lists of URLs to process, and can be made to send its resulting output to a single file which we can then process again. The one downside of cURL is that &#8211; contrary to WGet &#8211; it only does a single URL resolution hop; this means that where there are nested URL shorteners, our script will need to be run at least twice to resolve the final remaining short URLs (but that&#8217;s a small price to pay for greater convenience).<\/p>\n<p><font color=\"#ff0000\"><strong>Update:<\/strong> It&#8217;s worth noting here that <em>Twitter<\/em> has recently introduced its own URL shortener, <em>t.co<\/em>, as a mandatory shortener &#8211; even links which have already been shortened using <em>bit.ly<\/em> or other tools are now shortened <em>again<\/em> to <em>t.co<\/em> URLs. Recent <em>yourTwapperkeeper<\/em> archives will contain only <em>t.co<\/em> links, therefore. This also means that at least two passes of urlresolve.awk will be required to unshorten those URLs to their eventual destinations: one pass to remove the <em>t.co<\/em> shortening, and another to resolve any remaining short URLs. You might even want to run a third pass for good measure (later passes should run considerably faster as they&#8217;ll find far fewer URLs still to resolve).<\/font><\/p>\n<h3>Installing cURL<\/h3>\n<p>So, the first step to switching over is to install cURL itself, which is available for a wide variety of platforms <a href=\"http:\/\/curl.haxx.se\/download.html\">here<\/a>. If you&#8217;re expecting your <em>Twitter<\/em> data to contain https:\/\/&#8230; (secure http) URLs in addition to standard http:\/\/&#8230; links, make sure you install an SSL-capable version of cURL. On Windows XP, I&#8217;ve found the <a href=\"http:\/\/www.paehl.com\/open_source\/?CURL_7.24.0\">&#8216;Win32 &#8211; Generic&#8217; cURL version by Dirk Paehl<\/a> to work well for me (for https support, use the version &#8216;with SSL&#8217;, and you&#8217;ll also need to install the openssl library available from the same site). On Windows Vista or 7, the <a href=\"http:\/\/curl.haxx.se\/download.html#Win64\">&#8216;Win64 &#8211; Generic&#8217; version by Don Luchini<\/a> is fine; looks like you also need to install the Microsoft Visual C++ Redistributable package, though (<a href=\"http:\/\/answers.microsoft.com\/en-us\/windows\/forum\/windows_7-windows_programs\/the-program-cant-start-becuase-msvcr100dll-is\/5c9d301a-2191-4edb-916e-5e4958558090\">details here<\/a>). If you&#8217;re using a Mac, <strike>I&#8217;m afraid you&#8217;re on your own as far as installation goes &#8211; <a href=\"http:\/\/curl.haxx.se\/download.html#MacOSX\">a list of package options is here<\/a><\/strike> &#8211; see Jean&#8217;s comment below for instructions.<\/p>\n<p>cURL (and the openssl libraries, if you need them) need to be placed in the command path. Since &#8211; if you&#8217;ve been using any of our Gawk scripts at all so far &#8211; you&#8217;ve already installed Gawk, the easiest solution is to place curl.exe (and any openssl .dll files you may also need) in the same directory as Gawk itself. Most likely, this is C:\\Program Files\\GnuWin32\\bin (on Windows XP) or C:\\Program Files (x86)\\GnuWin32\\bin (on Windows Vista\/7). To test whether cURL is installed and working, open a command window and try something like<\/p>\n<blockquote>\n<p>curl &#45;&#45;head &#45;&#45;insecure https:\/\/google.com\/<\/p>\n<\/blockquote>\n<p>(change the https to http if you&#8217;ve installed a version of cURL which doesn&#8217;t do secure http). If your shell finds cURL, and cURL itself doesn&#8217;t complain that it&#8217;s missing a library somewhere, you&#8217;re ready to roll.<\/p>\n<h3>Resolving URLs<\/h3>\n<p>The first step in resolving URLs is to extract them from a <em>Twapperkeeper<\/em>\/<em>yourTwapperkeeper<\/em> CSV\/TSV dataset. The process for this remains exactly as before &#8211; <a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2011\/06\/22\/gawk-scripts-for-processing-twitter-data-vol-1\/\">the existing urlextract.awk script from our scripts package<\/a> does this for us (and generates multiple lines in the resulting file if a tweet happens to contain multiple URLs. Simply run the script as follows:<\/p>\n<blockquote>\n<p>gawk -F , -f urlextract.awk input.csv &gt;output.csv<\/p>\n<\/blockquote>\n<p>(I don&#8217;t have to remind you to use \\t instead of , as the separator if you&#8217;re working with tab-separated files, do I?)<\/p>\n<p>Now, then, it&#8217;s time to unveil our new urlresolve.awk script, which replaces the previous solution:<\/p>\n<div class=\"wlWriterEditableSmartContent\" id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:9fc47cb6-a512-4e3b-8cd1-94b74aff85d0\" style=\"padding-right: 0px; display: inline; padding-left: 0px; float: none; padding-bottom: 0px; margin: 0px; padding-top: 0px\">         <code><\/p>\n<pre>\r\n      \r\n# urlresolve.awk - resolves URL shortener URLs back to full URLs\r\n#\r\n# resolves shortened and other redirected URLs to their actual target\r\n# the script preserves the first line, expecting that it contains header information\r\n#\r\n# run urlextract.awk before running this script over the resulting file\r\n#\r\n# input format: CSV\/TSV generated by urlextract.awk, i.e.\r\n# url, [Twapperkeeper format]\r\n#\r\n# output format:\r\n# longurl, url, [Twapperkeeper format]\r\n#\r\n# script requires curl to be sitting in the command path\r\n# use an SSL-capable version of curl if you need to resolve https URLs\r\n#\r\n# script takes an optional argument maxlength, specifying the maximum length of URLs to be resolved\r\n# this is especially useful for running multiple passes over the dataset - after the first run,\r\n# maxlength can be reduced to pick up only the remaining short URLs\r\n#\r\n# maxlength defaults to 30 characters (including http:\/\/ etc.) if it isn't specified\r\n# \r\n# script will create temporary files in the current directory (for the curl output):\r\n# - [filename]_urllist_temp (a list of all URLs in the input CSV\/TSV file\r\n# - [filename]_urllist_temp_resolved (curl output from the URL resolution process)\r\n#\r\n# these files are not automatically deleted after the script finishes\r\n#\r\n# the script only performs one URL resolution pass - to resolve multiple nested URL wrappers\r\n# (e.g. bit.ly links shortened a second time by t.co, etc.), run urlresolve.awk over the output file again,\r\n# as often as desired (each round resolves an additional layer of URL shorteners\r\n#\r\n# NOTE: the script skips YFrog, Twitpic, Imgur, Twitgoo, and Instagram URLs - these don't resolve any further\r\n#\r\n# once URLs have been resolved, run urltruncate.awk over the resulting file to extract the domain names only\r\n#\r\n# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\r\n\r\nBEGIN {\r\n\tif(!maxlength) maxlength = 30\t\t\t\t\t\t# maxlength defaults to 30 characters\r\n\r\n\tgetline header\r\n\tprint \"longurl\" FS header\r\n\r\n\tgetline < FILENAME\t\t\t\t\t\t\t\t\t\t\t\t\t# skip header row\r\n\twhile(getline < FILENAME) {\t\t\t\t\t\t\t\t\t\t\t\t# load all URLs to be resolved - urlextract.awk stored URLs in column 1\r\n\t\tif(match($1,\/(yfrog.us|yfrog.com|twitpic.com|imgur.com|twitgoo.com|instagr.am)\/) == 0) {\t\t# skip various image hosting short URLs - they don't resolve any further\r\n\t\t\tif(length($1) <= maxlength) urls[$1] = $1\t\t\t\t\t\t\t\t# only short URLs of up to maxlength characters will be resolved\r\n\t\t}\r\n\t}\r\n\tclose(FILENAME)\t\t\t\t\t\t\t\t# finish URL detection\r\n\r\n\ttempfile = FILENAME \"_urllist_temp\"\r\n\ttempfile_resolved = tempfile \"_resolved\"\r\n\r\n\tfor(url in urls) {\t\t\t\t\t\t\t\t# generate list of URLs for curl to resolve\r\n\t\tprint \"url=\\\"\" url \"\\\"\" > tempfile\r\n\t}\r\n\r\n\t\t\t\t\t\t\t\t\t\t\t# run curl, using the tempfile\r\n\tsystem(\"curl -K \" tempfile \" --head --insecure --write-out !!urlresolve!!,%{url_effective},%{redirect_url}, >\" tempfile_resolved)\r\n\tclose(\"curl -K \" tempfile \" --head --insecure --write-out !!urlresolve!!,%{url_effective},%{redirect_url}, >\" tempfile_resolved)\r\n\r\n\toldFS = FS\t\t\t\t\t\t\t\t\t# switch to CSV format for the moment\r\n\tFS = \",\"\r\n\r\n\twhile(getline < tempfile_resolved) {\t\t\t\t\t# curl output contains the full HTML headers, as well as the resolve information\r\n\t\tif($1 == \"!!urlresolve!!\") {\t\t\t\t\t# resolve information is in the format \"!!urlresolve!!,[original URL],[resolved URL]\"\r\n\t\t\tresolved[$2] = $3\r\n\t\t\tif(!$3) resolved[$2] = $2\t\t\t\t\t# for broken or otherwise unresolved URLs, we retain the original link\r\n\t\t}\r\n\t}\r\n\r\n\tclose(tempfile_resolved)\r\n\r\n\tFS = oldFS\r\n\r\n}\r\n\r\n{\r\n\tif(!($1 in resolved)) resolved[$1] = $1\t\t\t\t\t# for URLs longer than maxlength, or if curl failed to run because of severely broken URLs, retain the original link\r\n\tprint resolved[$1] FS $0\t\t\t\t\t\t\t# match resolved URLs and original links, and print\r\n}\r\n        <\/pre>\n<p><\/code>\n      <\/div>\n<p>&#160; <br \/>This script takes the output from urlextract.awk and resolves all short URLs in the original dataset; it adds the resolved URLs in a new column 'longurl' which is inserted before the existing data. The script is called as follows:<\/p>\n<blockquote>\n<p>gawk -F , -f urlresolve.awk [maxlength=<em>x<\/em>] input.csv &gt;output.csv<\/p>\n<\/blockquote>\n<p>The (optional) maxlength argument specifies what we consider to be a <em>short<\/em> URL, and can further speed up the processing time: since the very point of short URLs is that they're, well, short, we can assume relatively safely that comparatively long URLs already point to the final destination URL, and don't need resolving. If maxlength isn't specified, it defaults to a relatively conservative value of 30 characters; to save time, you could drop that value down to 25 or less. A typical <em>bit.ly<\/em> URL, including the 'http:\/\/' part, is 20 characters long, a URL using <em>Facebook'<\/em>s shortener <em>fb.me<\/em> is 22 characters, and <em>youtu.be<\/em> URLs clock in at 27 characters, so you'll need to work out your own comfort zone here.<\/p>\n<p>It's also important to note that (again also to save time) the script will automatically skip over any URLs pointing to the image hosting services <em>YFrog<\/em>, <em>Twitpic<\/em>, <em>Imgur<\/em>, <em>Twitgoo<\/em>, and <em>Instagram<\/em>. The URLs used by such sites are short, but - to the best of my knowledge - don't resolve any further, so there's no need to process them here. (If you're aware of any other widely used non-resolving short URL services, or if any of the services listed above do occasionally resolve to different URLs, please let me know!)<\/p>\n<p>The urlresolve.awk script creates two temporary files in the working directory: <em>filename<\/em>_urllist_temp and <em>filename<\/em>_urllist_temp_resolved. These can be safely deleted once the script has finished, but may also be handy for spotchecking if any problems have occurred during URL resolution.<\/p>\n<p>As I've mentioned, cURL will only take one step in the URL resolution process. A multiply shortened URL won't have arrived at its final destination in one pass, therefore. It's useful to inspect the output file from the first pass visually (e.g. in Excel) to check whether there still are many shortened URLs remaining in the new 'longurl' column which urlresolve.awk has added. If so, simply run the script again, using the output file from the first pass as your input:<\/p>\n<blockquote>\n<p>gawk -F , -f urlresolve.awk [maxlength=<em>x<\/em>] output.csv &gt;output2.csv<\/p>\n<\/blockquote>\n<p>Since the first pass will have resolved many short URLs already (so that they will now be above the maxlength threshold), this second pass should conclude considerably more quickly. It will add yet another new 'longurl' column to the left of the existing data table. Repeat the process as often as necessary if any particularly obstinate cases remain.<\/p>\n<h3>Further Steps<\/h3>\n<p>Once you're happy with the URL resolution outcomes, it's easy to use the resulting dataset to find the most cited URLs or examine other relevant patterns in the data. In particular, it may also be useful to examine citation patterns not on the basis of fully qualified URLs, but by looking for the most cited domains only - this provides a better overview of which news or information sources overall were most widely used.<\/p>\n<p>To truncate URLs to their domain name, use our existing urltruncate.awk script (<a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2011\/06\/22\/gawk-scripts-for-processing-twitter-data-vol-1\/\">also from our Gawk scripts package<\/a>). It adds yet another new column, 'domain', to the left of the existing dataset, and is run as follows:<\/p>\n<blockquote>\n<p>gawk -F , -f urltruncate.awk input.csv &gt;output.csv<\/p>\n<\/blockquote>\n<p>Happy resolving!<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>When working with Twitter data, one of the most interesting questions is always what URLs tweets are linking to. As Twitter users discuss any given topic or issue, the URLs they share provide us with an indication of the online media they&#8217;re drawing on for information and\/or entertainment &#8211; and by counting which sites appear &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2012\/02\/10\/resolving-short-urls-a-new-approach\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Resolving Short URLs: A New Approach&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":1327,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,176,113,8],"tags":[157,7,156,155,6,298],"class_list":["post-1072","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-methods","category-processing","category-tools-2","category-twitter","tag-curl","tag-gawk","tag-resolving","tag-short-urls","tag-tools","tag-twitter","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/1072","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=1072"}],"version-history":[{"count":0,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/1072\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/1327"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=1072"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=1072"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=1072"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}