{"id":269,"date":"2010-09-22T13:49:55","date_gmt":"2010-09-22T03:49:55","guid":{"rendered":"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/09\/22\/more-blog-network-data-cleaning-with-gawk\/"},"modified":"2012-04-30T14:46:04","modified_gmt":"2012-04-30T04:46:04","slug":"more-blog-network-data-cleaning-with-gawk","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2010\/09\/22\/more-blog-network-data-cleaning-with-gawk\/","title":{"rendered":"More Blog Network Data Cleaning with Gawk"},"content":{"rendered":"<p>The other day I outlined <a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/09\/20\/cleaning-up-blog-network-data-with-gawk\/\">some first steps in cleaning our blog network data<\/a> (generated by our partner researchers at <a href=\"http:\/\/sociomantic.com\/\">Sociomantic Labs<\/a>) ahead of visualising it, and posted <a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/09\/20\/first-steps-in-mapping-the-australian-blogosphere\/\">a first tentative visualisation of the part of the Australian blogosphere<\/a> that we&#8217;re currently tracking. In this post I&#8217;ll continue that discussion, describing a few more steps in processing the data (again using <a href=\"http:\/\/www.gnu.org\/software\/gawk\/\">Gawk<\/a>).<\/p>\n<p>Just to reiterate briefly the current limitations of our dataset:<\/p>\n<ul>\n<li>We&#8217;re tracking some 8,500 feeds at the moment, some of which are mainstream news sites or other sites with RSS feeds &#8211; so we&#8217;re only covering a part of the overall Australian blogosphere at this point.<\/li>\n<li>We&#8217;re still improving our approaches to extracting post texts and links from the blog pages &#8211; right now, our data still include text and links which are not in the posts themselves, but elsewhere on the page.<\/li>\n<\/ul>\n<p>But even so, we can already begin to test our methods. Now, what we managed to get to in the previous post was to develop a Gawk script that truncated link destinations to their most meaningful component, in order to make network visualisation possible &#8211; if the link destination matched the base URL of one of the sites we&#8217;re following (e.g. <em>domain.com.au\/blog\/<\/em>), we used that URL instead of the full link URL (e.g. <em>domain.com.au\/blog\/post-title.html<\/em>); if the link destination was unknown, we truncated it to the domain only (e.g. <em>domain.com.au<\/em>). To improve the readability of the resulting network graph, we also dropped &#8216;http:\/\/&#8217; and &#8216;www.&#8217;.<\/p>\n<p>The first outcome from this process were the network maps I published in my last post, which further filtered the network to include only those sites which we&#8217;re actively tracking (including a number of mainstream media sites). But clearly that&#8217;s only one part of the picture &#8211; we&#8217;re just as interested in the extent to which the blogs we&#8217;re tracking are linking to <em>other<\/em> sites, from mainstream media in Australia and elsewhere through <em>YouTube<\/em>, <em>Flickr<\/em>, <em>Facebook<\/em>, and other social media sites, to any other sites which may be relevant to all Australian bloggers or any specific clusters in the blogosphere. To get there, we&#8217;ll have to massage the data a little further.<\/p>\n<p><!--more--><\/p>\n<p>Here, we&#8217;re starting again with the truncated network data file as I&#8217;ve described it above &#8211; a CSV file in the format <em>source,destination<\/em> which typically contains lines such as<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:1d84dfb8-31de-40a2-a82f-ca8dfed77bf3\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\">\n<pre>blogs.crikey.com.au\/pollytics,clubtroppo.com.au\r\ncatallaxyfiles.com,johnquiggin.com<\/pre>\n<\/div>\n<p>First, I&#8217;m interested in removing from this list any links that do not originate from actual blogs. Again, as part of what we&#8217;re doing in this project we&#8217;re also tracking some mainstream media sites and other sources, but here I&#8217;m interested only in what the <em>blogs<\/em> are linking to. Happily, we&#8217;ve categorised the list of sites we follow, so we already have a simple list of non-blog sites that we can use to filter the network data. I&#8217;m using this list as &#8216;sourcefilterlist.csv&#8217; in the following script:<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:7f5b2c77-3d9e-4be3-bf9d-fa5b6ffde8f7\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\">\n<pre># sourcefilter.awk - filter known irrelevant source nodes from a network CSV\r\n#\r\n# this script takes a network data CSV file and removes any links originating from nodes listed on a filter list\r\n#\r\n# script expects 'sourcefilterlist.csv' to be located in current directory\r\n# filter list should contain nodes to be filtered in its first column\r\n#\r\n# expected data format:\r\n# from,to\r\n#\r\n# output format:\r\n# from,to (filtered)\r\n#\r\n# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\r\n\r\nBEGIN {\r\n\twhile(getline &lt; \"sourcefilterlist.csv\") {\r\n\t\tsub(\"http:\/\/\", \"\", $1)\r\n\t\tsub(\"www[.]\", \"\", $1)\r\n\t\tsub(\"\/$\", \"\", $1)\r\n\t\tfilterblog[$1] = $1\r\n\t}\r\n\r\n\tgetline\r\n\tprint \"from,to\"\r\n}\r\n\r\n!filterblog[$1] {\r\n\t# omit any links which end in empty destinations\r\n\tif($2) print $0\r\n}<\/pre>\n<\/div>\n<p>The script makes use of some very handy features of the Gawk scripting language &#8211; array items can&#8217;t just be named array[1], array[2], etc., but can have any name, so the BEGIN statement above creates an array filterblog[] that contains items such as filterblog[<em>domain1.com.au<\/em>], filterblog[<em>domain2.com.au<\/em>], etc. In the main clause of the script, all we then need to do is to check whether a filterblog[] array entry exists that is named after the source URL of the specific link we&#8217;re processing. (To ensure accurate matches, I&#8217;m again removing any occurrences of &#8216;http:\/\/&#8217;, &#8216;www.&#8217;, or trailing slashes from the list of source URLs to be filtered, by the way.)<\/p>\n<p>So, that&#8217;s the first filtering step out of the way. This will have removed any links originating from the non-blog sites we&#8217;re tracking: for example, it will have removed news.com.au or news.bbc.co.uk (but not blogs.news.com.au\/dailytelegraph\/joehildebrand, since that&#8217;s a genuine blog, if hosted on an MSM site).<\/p>\n<p>The second step is to do some filtering of the destination URLs, too. This will become less important as we improve our link extraction algorithms, but right now there still are plenty of destination URLs that are merely functional and have no real meaning, but will seriously skew our data if left unchecked. In the first place, then, let&#8217;s run a very simple script to count the number of inlinks each destination URL receives:<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:109ee6c6-c546-45b7-ac55-d3f1fa1a065e\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\">\n<pre># linkcount.awk - count indegree for destinations in a network data CSV file\r\n#\r\n# this script takes a network data CSV file and counts how many inlinks each destination node receives\r\n#\r\n# expected data format:\r\n# from,to\r\n#\r\n# output format:\r\n# name of the destination node, number of inlinks received\r\n#\r\n# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\r\n\r\nBEGIN {\r\n\tprint \"destination,indegree\"\r\n}\r\n\r\n{\r\n\tlinkcount[$2]++\r\n}\r\n\r\nEND {\r\n\tfor(i in linkcount) {\r\n\t\tprint i \",\" linkcount[i]\r\n\t}\r\n}<\/pre>\n<\/div>\n<p>The output from this script is another simple CSV file containing the columns <em>destination,indegree<\/em>. Sorting this list by the value of indegree generates the following list, for my current dataset (showing the top 20 only):<\/p>\n<table border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr>\n<td width=\"244\"><strong>destination<\/strong><\/td>\n<td width=\"362\"><strong>indegree<\/strong><\/td>\n<\/tr>\n<tr>\n<td>blogger.com<\/td>\n<td>217036<\/td>\n<\/tr>\n<tr>\n<td>twitter.com<\/td>\n<td>83763<\/td>\n<\/tr>\n<tr>\n<td>heraldsun.com.au<\/td>\n<td>55457<\/td>\n<\/tr>\n<tr>\n<td>facebook.com<\/td>\n<td>36616<\/td>\n<\/tr>\n<tr>\n<td>en.wordpress.com<\/td>\n<td>35408<\/td>\n<\/tr>\n<tr>\n<td>feeds.feedburner.com<\/td>\n<td>35129<\/td>\n<\/tr>\n<tr>\n<td>flickr.com<\/td>\n<td>34528<\/td>\n<\/tr>\n<tr>\n<td>blogcatalog.com<\/td>\n<td>31054<\/td>\n<\/tr>\n<tr>\n<td>healthproductreview.net<\/td>\n<td>29236<\/td>\n<\/tr>\n<tr>\n<td>feedproxy.google.com<\/td>\n<td>28433<\/td>\n<\/tr>\n<tr>\n<td>wordpress.org<\/td>\n<td>24660<\/td>\n<\/tr>\n<tr>\n<td>dailytelegraph.com.au<\/td>\n<td>22516<\/td>\n<\/tr>\n<tr>\n<td>fashionising.com<\/td>\n<td>14283<\/td>\n<\/tr>\n<tr>\n<td>www2b.abc.net.au<\/td>\n<td>14244<\/td>\n<\/tr>\n<tr>\n<td>3.bp.blogspot.com<\/td>\n<td>13904<\/td>\n<\/tr>\n<tr>\n<td>4.bp.blogspot.com<\/td>\n<td>13870<\/td>\n<\/tr>\n<tr>\n<td>1.bp.blogspot.com<\/td>\n<td>13730<\/td>\n<\/tr>\n<tr>\n<td>2.bp.blogspot.com<\/td>\n<td>13531<\/td>\n<\/tr>\n<tr>\n<td>youtube.com<\/td>\n<td>13086<\/td>\n<\/tr>\n<tr>\n<td>statcounter.com<\/td>\n<td>11192<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Some of these sites clearly appear here not because the bloggers linked to them in their posts, but for far more simple, functional reasons. <em>Blogger.com<\/em>, <em>en.wordpress.com<\/em>, and <em>WordPress.org<\/em> appear only because many blogs run on either Blogger or WordPress; <em>feeds.feedburner.com<\/em> and <em>feedproxy.google.com<\/em> are there because some bloggers choose to pipe their RSS feeds through these sites rather than offer them directly from their own domains; <em>www2b.abc.net.au<\/em> is the domain that the ABC blogs&#8217; commenting and complaints functions are served from; <em>1\/2\/3\/4.bp.blogspot.com<\/em> is where Blogger blogs store their embedded images and other media content; and <em>statcounter.com<\/em> should be self-explanatory.<\/p>\n<p>Each of these URLs, and many others outside the top 20, can be safely removed from the network without affecting our analysis &#8211; remember that what we&#8217;re interested in here is whom bloggers are actively linking to as part of their distributed conversations, not what functional links are included on their sites. By contrast, links to mainstream media sites such as the <em>Herald Sun<\/em> or <em>Daily Telegraph<\/em>, to social media sites like <em>Twitter<\/em>, <em>Facebook<\/em>, <em>Flickr<\/em>, or <em>YouTube<\/em>, should stay, even if <em>some<\/em> of them might merely be static links to the blogger&#8217;s own social media profile rather than discursive links to specific pages on these sites that might be relevant to the content of a blog post.<\/p>\n<p>For our purposes here, I&#8217;ve worked through the top 100 most linked-to sites in the network, and added the following 34 sites to my destination URL filter list:<\/p>\n<blockquote><p>blogger.com, feeds.feedburner.com, en.wordpress.com, blogcatalog.com, feedproxy.google.com, wordpress.org, www2b.abc.net.au, 3.bp.blogspot.com, 4.bp.blogspot.com, 1.bp.blogspot.com, 2.bp.blogspot.com, membercentre.fairfax.com.au, statcounter.com, ad.au.doubleclick.net, alluremedia.com.au, networkedblogs.com, c.moreover.com, addthis.com, widgetbox.com, wordpress.com, feeds.nytimes.com, linkwithin.com, adserver.adtechus.com, quantcast.com, feedburner.google.com, delicious.com, linkedin.com, feeds2.feedburner.com, feedjit.com<\/p><\/blockquote>\n<p>Nothing in this list should be particularly controversial &#8211; in addition to the sites I&#8217;ve already mentioned, there are a handful of advertising servers, various widget servers (<em>addthis.com<\/em>, for example, serves a javascript snippet that allows visitors to quickly tweet or bookmark a page), and sites such as <em>delicious.com<\/em> and <em>linkedin.com<\/em> which I would expect to be linked to predominantly as a way of pointing to the blogger&#8217;s bookmarks and profile, not for discursive reasons. Conversely, I&#8217;ve decided against including <em>google.com<\/em> here, which also appears prominently in the top 100, because there are so many possible reasons for linking to a <em>Google<\/em> site or sub-site that we just can&#8217;t be sure whether the link is functional (e.g. on-site search) or discursive (e.g. link to the translation of a foreign-language site).<\/p>\n<p>With this list stored in a file called &#8216;destinationfilterlist.csv&#8217;, then, we can call up another script that is virtually identical to sourcefilter.awk, but processes the destination URL of our network data:<\/p>\n<div id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:bea0cf30-9043-4982-93e1-61a555947e7f\" class=\"wlWriterEditableSmartContent\" style=\"display: inline; float: none; margin: 0px; padding: 0px;\">\n<pre># destinationfilter.awk - filter known irrelevant destination nodes from a network CSV\r\n#\r\n# this script takes a network data CSV file and removes any links pointing to nodes listed on a filter list\r\n#\r\n# script expects 'destinationfilterlist.csv' to be located in current directory\r\n# filter list should contain nodes to be filtered in its first column\r\n#\r\n# expected data format:\r\n# from,to\r\n#\r\n# output format:\r\n# from,to (filtered)\r\n#\r\n# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\r\n\r\nBEGIN {\r\n\twhile(getline &lt; \"destinationfilterlist.csv\") {\r\n\t\tsub(\"http:\/\/\", \"\", $1)\r\n\t\tsub(\"www[.]\", \"\", $1)\r\n\t\tsub(\"\/$\", \"\", $1)\r\n\t\tfilterblog[$1] = $1\r\n\t}\r\n\r\n\tgetline\r\n\tprint \"from,to\"\r\n}\r\n\r\n!filterblog[$2] {\r\n\t# omit any links which end in empty destinations\r\n\tif($2) print $0\r\n}<\/pre>\n<\/div>\n<p>Running this script over the output of sourcefilter.awk, we&#8217;re now left with a link network CSV file that includes only links originating from the genuine blogs within the overall population of sites we&#8217;re tracking, and only to destinations which are not on our filter list.<\/p>\n<p>For the period of 17 July to 27 August 2010, <a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/09\/20\/first-steps-in-mapping-the-australian-blogosphere\/\">which I looked at in the previous post<\/a>, this now reduces the number of total links from over 6.8 million to just under 3.4 million. I&#8217;ll post a visualisation of the network in my next post.<\/p>\n<p>Feature image from <a title=\"GNU Operating Systems.\" href=\"http:\/\/www.gnu.org\/\" target=\"_blank\">GNU Operating Systems<\/a>.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>The other day I outlined some first steps in cleaning our blog network data (generated by our partner researchers at Sociomantic Labs) ahead of visualising it, and posted a first tentative visualisation of the part of the Australian blogosphere that we&#8217;re currently tracking. In this post I&#8217;ll continue that discussion, describing a few more steps &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2010\/09\/22\/more-blog-network-data-cleaning-with-gawk\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;More Blog Network Data Cleaning with Gawk&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":1489,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[191,5,176],"tags":[45,44,7,23,46,42,41],"class_list":["post-269","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blogs-2","category-methods","category-processing","tag-blogosphere","tag-blogs","tag-gawk","tag-gephi","tag-hyperlinks","tag-mapping","tag-network","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/269","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=269"}],"version-history":[{"count":0,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/269\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/1489"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=269"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=269"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=269"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}