{"id":137,"date":"2010-08-02T17:28:07","date_gmt":"2010-08-02T07:28:07","guid":{"rendered":"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/08\/02\/using-gawk-to-resolve-url-shorteners\/"},"modified":"2022-02-22T22:16:14","modified_gmt":"2022-02-22T12:16:14","slug":"using-gawk-to-resolve-url-shorteners","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2010\/08\/02\/using-gawk-to-resolve-url-shorteners\/","title":{"rendered":"Using Gawk and Wget to Resolve URL Shorteners"},"content":{"rendered":"<p><a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/08\/02\/most-tweeted-ausvotes-links-last-wee\/\">Jean&#8217;s post today<\/a> points to a key problem in examining user activities on <em>Twitter<\/em> and elsewhere &#8211; people are increasingly using bit.ly and other URL shorteners, which means that a) the same target URL might appear in any number of different shortened versions, and b) it&#8217;s no longer possible from a quick look at a list of URLs to select only those which are from a specific site (for example, <em>YouTube<\/em> videos).<\/p>\n<p>For our purposes, that&#8217;s a significant problem &#8211; we might want to find out, for example, which were the most popular videos shared during the election campaign, the most popular articles on <em>abc.net.au<\/em>, and so on. So, we need to resolve those shortened URLs back to their original state. This could be done through the APIs of the various shortening services, of course, but with literally <a href=\"http:\/\/long-shore.com\/services\">hundreds<\/a> of different shorteners now available, that would probably require specific unshortening scripts for each service &#8211; far too much work. So, what can we do?<\/p>\n<p><!--more--><\/p>\n<p>Once again, our favourite CSV processing tool, <a href=\"http:\/\/www.gnu.org\/software\/gawk\/\">Gawk<\/a>, can help. In fact, Gawk has some in-built networking functions, and there&#8217;s even a standard script for retrieving Web pages that might form the basis for an URL resolver tool &#8211; but unfortunately, the Windows version of Gawk doesn&#8217;t implement that networking functionality. But there&#8217;s a workaround using another open-source tool, <a href=\"http:\/\/www.gnu.org\/software\/wget\/\">Wget<\/a>: we can call it from a Gawk script to do the URL retrieval work and check for any URL redirections in the process. (Like Gawk, Wget is available in versions for all major operation systems.)<\/p>\n<p><a href=\"https:\/\/ai6.net\">URL Shortening<\/a> is necessary in twitter. But one step at a time. In the first place, here&#8217;s linkextract.awk, a Gawk script to extract the URLs from a list of tweets (for example, a <em>Twapperkeeper<\/em> archive). Where there are multiple URLs in the same tweet, the script will create multiple lines; its output is a new CSV file which contains the URL in the first column, followed by all the original information.<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:adc250ae-4cbe-47f2-bd34-5d1a3e7dbeba\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\"><code><code><\/code><\/code><\/p>\n<p><code><code><\/code><\/code><\/p>\n<pre>      \n# linkextract.awk - extract hyperlinks from Twapperkeeper export data\n#\n# this script takes a CSV archive of tweets in Twapperkeeper format\n#\n# expected data format:\n# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time\n#\n# output format:\n# link, [original fields]\n#\n# for tweets containing multiple links, the script adds new rows for each link\n#\n# Released under Creative Commons (BY, NC, SA) by Jean Burgess and Axel Bruns - je.burgess@qut.edu.au \/ a.bruns@qut.edu.au\n\nBEGIN {\n\tgetline header\n\tprint \"url,\" header\n\tIGNORECASE = 1\n}\n\n$0 ~ \/http([A-Za-z0-9[:punct:]]+)\/ {\n\na=0\ndo {\n\tmatch(substr($1, a),\/http([A-Za-z0-9[:punct:]]+)?\/,atArray)\n\ta=a+atArray[0, \"start\"]+atArray[0, \"length\"]\n\n\tif (atArray[0] != 0) print atArray[0] \",\" $0\n\n} while(atArray[0, \"start\"] != 0)\n\n}\n<\/pre>\n<p><code><\/code><\/p>\n<\/div>\n<p>Normally, you&#8217;d want to invoke this script using a command line like <span style=\"font-family: Courier New;\">gawk -F , -f linkextract.awk input.csv &gt;output.csv<\/span>.<\/p>\n<p>So far, so good. Next, we need a script that resolves the links in the new &#8216;url&#8217; column in the CSV to their full URLs (and leaves intact those URLs which don&#8217;t use URL shorteners, of course). For this, we need three interlocking scripts. First, urlget,awk: this simply takes a command-line input and calls up Wget (with the &#8211;spider option, to avoid having to retrieve the whole page), and sends the Wget output to a temporary file called urltemp.tmp, in the current directory).<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:01251355-3100-45cf-b133-7ad47efb1bee\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\"><code><code><\/code><\/code><\/p>\n<p><code><code><\/code><\/code><\/p>\n<pre>      \n# urlget.awk - use wget to spider a given URL (passed to the script as argument)\n#\n# outputs wget result as file 'urltemp.tmp' to current directory\n#\n# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\n\n\nBEGIN {\n\n\tsystem(\"wget -t 1 -o urltemp.tmp --spider \" url)\n\n}\n\n\n<\/pre>\n<p><code><\/code><\/p>\n<\/div>\n<p>You could call this up using <span style=\"font-family: Courier New;\">gawk -f urlget.awk -v url=http:\/\/abc.net.au\/<\/span> to see what it does &#8211; open the resulting &#8216;urltemp.tmp&#8217; file to see the Wget output.<\/p>\n<p>The second script we need is parseurltemp.awk, which takes the Wget output and extracts from it what for our purposes is the relevant information. When called with the &#8211;spider option, Wget doesn&#8217;t retrieve the whole page at the URL, but instead simply tests that it is there and generates some header information. From this, we want to extract the original source URL (on line 2 of Wget&#8217;s output, following a retrieval timestamp), and the URL of the target page it it differs from the source. If it is different, the target appears on a line that starts with the word &#8220;Location:&#8221; &#8211; so we know what to look for here, too.<\/p>\n<p><span style=\"color: #ff0000;\"><strong>UPDATE:<\/strong> Looks like there are cases where Wget produces output with multiple &#8216;Location:&#8217; lines. I&#8217;ve now changed parseurl.awk to stop after the first one, which seems to contain what we&#8217;re after&#8230;<\/span><\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:b0a14b36-f462-4daf-9744-ee5b44eae990\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\"><code><code><\/code><\/code><\/p>\n<p><code><code><\/code><\/code><\/p>\n<pre>      \n# parseurltemp.awk - extract redirection source and target URLs from wget spider output\n#\n# parses standard wget --spider output\n# extracts source URL from second line: e.g. --2010-08-02 14:41:47--  http:\/\/smh.com.au\/\n# extracts target URL from 'Location: ' statement\n# sets target = source if no target is found (i.e. no redirection present)\n#\n# output: source,target\n#\n# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\n\n{\n\tif((NR == 2) &amp;&amp; (match($0,\"--  \"))) { \n\t\tsource = substr($0, 26, length($0) - 25)\n\t\ttarget = source\t\t\t\t# in case no target location is found\n\t}\n\n\n\tif(match($0, \"Location\")) {\n\t\ttarget = substr($0, 11, length($0) - 22)\n\t\texit\n\t}\n\n}\n\nEND {\n\tprint source \",\" target\n}\n\n<\/pre>\n<p><code><\/code><\/p>\n<\/div>\n<p>So, parseurltemp.awk simply checks the file it is fed for those two lines, and puts them out in a simple source,target CSV format. If no redirection from the source URL is detected, target is set to the same value as source. Again, this isn&#8217;t a script we want to call manually &#8211; so to join all of this together, then, we need a third script that invokes urlget.awk and parseurltemp.awk for each line of a CSV file that contains the URLs we want to resolve: urlresolve.awk.<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:632f05a5-7aed-4faa-8ed8-cd29414dfefb\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\"><code><code><\/code><\/code><\/p>\n<p><code><code><\/code><\/code><\/p>\n<pre>      \n# urlresolve.awk - resolves URL shortener URls back to full URLs\n#\n# script requires:\n# - wget to be sitting in the command path\n# - path variable to be set to common directory for AWK script\n# - urlget.awk and parseurltemp.awk to be in script path directory\n# \n# script will create a temporary file 'urltemp.tmp' in the current directory (for the wget output)\n#\n# released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\n\nBEGIN {\n\tpath = \"..\\\\_scripts\\\\\"\t\t# point this to AWK script directory\n\tgetline header\n\tprint \"longurl,\" header\n}\n\n\n{\n\n\toriginal = $0\n\n\tsystem(\"gawk -F , -v url=\" $1 \" -f \" path \"urlget.awk\")\n\n\t\"gawk -F , -f \" path \"parseurltemp.awk urltemp.tmp\" | getline $resolved\n\tclose(\"gawk -F , -f \" path \"parseurltemp.awk urltemp.tmp\")\n\n\tprint $2 \",\" original\n\n}\n<\/pre>\n<p><code><\/code><\/p>\n<\/div>\n<p>This script, then, takes any CSV file which contains simple URLs in its first column; for each line of the CSV file, it first invokes urlget.awk to spider the source URL and generate the urltemp.tmp file, and then parseurltemp.awk to parse the temp file and identify the target of any URL redirection that may be in place. It adds the resolved target URLs to the CSV file as a new first column, called &#8216;longurl&#8217; (with the resolved URL set to the original URL if no redirection was detected).<\/p>\n<p>Usually, this would then be invoked using something like <span style=\"font-family: Courier New;\">gawk -F , -f urlresolve.awk output.csv &gt;resolved.csv<\/span> (using the output.csv file created by the linkextract.awk script above). Depending on your own setup, you&#8217;ll also need to change the &#8216;path&#8217; variable in the BEGIN statement of linkextract.awk &#8211; this should point to where all your gawk scripts are located.<\/p>\n<p>So there it is &#8211; a relatively lightweight Gawk\/Wget solution for resolving shortened URLs to the original. Plenty of room for improvement, no doubt &#8211; and we&#8217;d love to hear from anyone who has done any further work on these scripts. One small issue we&#8217;ve identified already: there&#8217;s a handful of sites which don&#8217;t report back full &#8216;Location&#8217; URLs, but only relative ones (i.e. &#8216;\/home\/index.html&#8217; instead of &#8216;http:\/\/example.com\/home\/index.html&#8217;). This doesn&#8217;t seem to be a particularly common problem, so I&#8217;m reluctant to spend much time adding a special fix to the scripts, but it does mean some manual tidying of the output may be necessary. <span style=\"color: #ff0000;\"><strong>UPDATE:<\/strong> My revised parseurl.awk seems to deal with those problems, too.<\/span><\/p>\n<p><strong>Finally, a word of warning:<\/strong> with large input CSVs, this URL resolution process might take considerable time, as each URL needs to be spidered individually &#8211; right now, there&#8217;s no caching of Wget results built in here (in the time it&#8217;s taken me to write this post, my script has managed to ping about 3000 URLs, for what it&#8217;s worth). Also, the rapid-fire querying of URLs might upset some ISPs &#8211; so only try this at home if you know what you&#8217;re doing, folks&#8230;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>Jean&#8217;s post today points to a key problem in examining user activities on Twitter and elsewhere &#8211; people are increasingly using bit.ly and other URL shorteners, which means that a) the same target URL might appear in any number of different shortened versions, and b) it&#8217;s no longer possible from a quick look at a &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2010\/08\/02\/using-gawk-to-resolve-url-shorteners\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Using Gawk and Wget to Resolve URL Shorteners&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":1327,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5,176,8],"tags":[7,297,9,298,31,30],"class_list":["post-137","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-methods","category-processing","category-twitter","tag-gawk","tag-methods","tag-twapperkeeper","tag-twitter","tag-url-shorteners","tag-wget","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/137","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=137"}],"version-history":[{"count":1,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/137\/revisions"}],"predecessor-version":[{"id":3494,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/137\/revisions\/3494"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/1327"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=137"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=137"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=137"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}