More Blog Network Data Cleaning with Gawk

The other day I outlined some first steps in cleaning our blog network data (generated by our partner researchers at Sociomantic Labs) ahead of visualising it, and posted a first tentative visualisation of the part of the Australian blogosphere that we’re currently tracking. In this post I’ll continue that discussion, describing a few more steps in processing the data (again using Gawk).

Just to reiterate briefly the current limitations of our dataset:

We’re tracking some 8,500 feeds at the moment, some of which are mainstream news sites or other sites with RSS feeds – so we’re only covering a part of the overall Australian blogosphere at this point.
We’re still improving our approaches to extracting post texts and links from the blog pages – right now, our data still include text and links which are not in the posts themselves, but elsewhere on the page.

But even so, we can already begin to test our methods. Now, what we managed to get to in the previous post was to develop a Gawk script that truncated link destinations to their most meaningful component, in order to make network visualisation possible – if the link destination matched the base URL of one of the sites we’re following (e.g. domain.com.au/blog/), we used that URL instead of the full link URL (e.g. domain.com.au/blog/post-title.html); if the link destination was unknown, we truncated it to the domain only (e.g. domain.com.au). To improve the readability of the resulting network graph, we also dropped ‘http://’ and ‘www.’.

The first outcome from this process were the network maps I published in my last post, which further filtered the network to include only those sites which we’re actively tracking (including a number of mainstream media sites). But clearly that’s only one part of the picture – we’re just as interested in the extent to which the blogs we’re tracking are linking to other sites, from mainstream media in Australia and elsewhere through YouTube, Flickr, Facebook, and other social media sites, to any other sites which may be relevant to all Australian bloggers or any specific clusters in the blogosphere. To get there, we’ll have to massage the data a little further.

Here, we’re starting again with the truncated network data file as I’ve described it above – a CSV file in the format source,destination which typically contains lines such as

blogs.crikey.com.au/pollytics,clubtroppo.com.au
catallaxyfiles.com,johnquiggin.com

First, I’m interested in removing from this list any links that do not originate from actual blogs. Again, as part of what we’re doing in this project we’re also tracking some mainstream media sites and other sources, but here I’m interested only in what the blogs are linking to. Happily, we’ve categorised the list of sites we follow, so we already have a simple list of non-blog sites that we can use to filter the network data. I’m using this list as ‘sourcefilterlist.csv’ in the following script:

# sourcefilter.awk - filter known irrelevant source nodes from a network CSV
#
# this script takes a network data CSV file and removes any links originating from nodes listed on a filter list
#
# script expects 'sourcefilterlist.csv' to be located in current directory
# filter list should contain nodes to be filtered in its first column
#
# expected data format:
# from,to
#
# output format:
# from,to (filtered)
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	while(getline < "sourcefilterlist.csv") {
		sub("http://", "", $1)
		sub("www[.]", "", $1)
		sub("/$", "", $1)
		filterblog[$1] = $1
	}

	getline
	print "from,to"
}

!filterblog[$1] {
	# omit any links which end in empty destinations
	if($2) print $0
}

The script makes use of some very handy features of the Gawk scripting language – array items can’t just be named array[1], array[2], etc., but can have any name, so the BEGIN statement above creates an array filterblog[] that contains items such as filterblog[domain1.com.au], filterblog[domain2.com.au], etc. In the main clause of the script, all we then need to do is to check whether a filterblog[] array entry exists that is named after the source URL of the specific link we’re processing. (To ensure accurate matches, I’m again removing any occurrences of ‘http://’, ‘www.’, or trailing slashes from the list of source URLs to be filtered, by the way.)

So, that’s the first filtering step out of the way. This will have removed any links originating from the non-blog sites we’re tracking: for example, it will have removed news.com.au or news.bbc.co.uk (but not blogs.news.com.au/dailytelegraph/joehildebrand, since that’s a genuine blog, if hosted on an MSM site).

The second step is to do some filtering of the destination URLs, too. This will become less important as we improve our link extraction algorithms, but right now there still are plenty of destination URLs that are merely functional and have no real meaning, but will seriously skew our data if left unchecked. In the first place, then, let’s run a very simple script to count the number of inlinks each destination URL receives:

# linkcount.awk - count indegree for destinations in a network data CSV file
#
# this script takes a network data CSV file and counts how many inlinks each destination node receives
#
# expected data format:
# from,to
#
# output format:
# name of the destination node, number of inlinks received
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	print "destination,indegree"
}

{
	linkcount[$2]++
}

END {
	for(i in linkcount) {
		print i "," linkcount[i]
	}
}

The output from this script is another simple CSV file containing the columns destination,indegree. Sorting this list by the value of indegree generates the following list, for my current dataset (showing the top 20 only):

destination	indegree
blogger.com	217036
twitter.com	83763
heraldsun.com.au	55457
facebook.com	36616
en.wordpress.com	35408
feeds.feedburner.com	35129
flickr.com	34528
blogcatalog.com	31054
healthproductreview.net	29236
feedproxy.google.com	28433
wordpress.org	24660
dailytelegraph.com.au	22516
fashionising.com	14283
www2b.abc.net.au	14244
3.bp.blogspot.com	13904
4.bp.blogspot.com	13870
1.bp.blogspot.com	13730
2.bp.blogspot.com	13531
youtube.com	13086
statcounter.com	11192

Some of these sites clearly appear here not because the bloggers linked to them in their posts, but for far more simple, functional reasons. Blogger.com, en.wordpress.com, and WordPress.org appear only because many blogs run on either Blogger or WordPress; feeds.feedburner.com and feedproxy.google.com are there because some bloggers choose to pipe their RSS feeds through these sites rather than offer them directly from their own domains; www2b.abc.net.au is the domain that the ABC blogs’ commenting and complaints functions are served from; 1/2/3/4.bp.blogspot.com is where Blogger blogs store their embedded images and other media content; and statcounter.com should be self-explanatory.

Each of these URLs, and many others outside the top 20, can be safely removed from the network without affecting our analysis – remember that what we’re interested in here is whom bloggers are actively linking to as part of their distributed conversations, not what functional links are included on their sites. By contrast, links to mainstream media sites such as the Herald Sun or Daily Telegraph, to social media sites like Twitter, Facebook, Flickr, or YouTube, should stay, even if some of them might merely be static links to the blogger’s own social media profile rather than discursive links to specific pages on these sites that might be relevant to the content of a blog post.

For our purposes here, I’ve worked through the top 100 most linked-to sites in the network, and added the following 34 sites to my destination URL filter list:

blogger.com, feeds.feedburner.com, en.wordpress.com, blogcatalog.com, feedproxy.google.com, wordpress.org, www2b.abc.net.au, 3.bp.blogspot.com, 4.bp.blogspot.com, 1.bp.blogspot.com, 2.bp.blogspot.com, membercentre.fairfax.com.au, statcounter.com, ad.au.doubleclick.net, alluremedia.com.au, networkedblogs.com, c.moreover.com, addthis.com, widgetbox.com, wordpress.com, feeds.nytimes.com, linkwithin.com, adserver.adtechus.com, quantcast.com, feedburner.google.com, delicious.com, linkedin.com, feeds2.feedburner.com, feedjit.com

Nothing in this list should be particularly controversial – in addition to the sites I’ve already mentioned, there are a handful of advertising servers, various widget servers (addthis.com, for example, serves a javascript snippet that allows visitors to quickly tweet or bookmark a page), and sites such as delicious.com and linkedin.com which I would expect to be linked to predominantly as a way of pointing to the blogger’s bookmarks and profile, not for discursive reasons. Conversely, I’ve decided against including google.com here, which also appears prominently in the top 100, because there are so many possible reasons for linking to a Google site or sub-site that we just can’t be sure whether the link is functional (e.g. on-site search) or discursive (e.g. link to the translation of a foreign-language site).

With this list stored in a file called ‘destinationfilterlist.csv’, then, we can call up another script that is virtually identical to sourcefilter.awk, but processes the destination URL of our network data:

# destinationfilter.awk - filter known irrelevant destination nodes from a network CSV
#
# this script takes a network data CSV file and removes any links pointing to nodes listed on a filter list
#
# script expects 'destinationfilterlist.csv' to be located in current directory
# filter list should contain nodes to be filtered in its first column
#
# expected data format:
# from,to
#
# output format:
# from,to (filtered)
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	while(getline < "destinationfilterlist.csv") {
		sub("http://", "", $1)
		sub("www[.]", "", $1)
		sub("/$", "", $1)
		filterblog[$1] = $1
	}

	getline
	print "from,to"
}

!filterblog[$2] {
	# omit any links which end in empty destinations
	if($2) print $0
}

Running this script over the output of sourcefilter.awk, we’re now left with a link network CSV file that includes only links originating from the genuine blogs within the overall population of sites we’re tracking, and only to destinations which are not on our filter list.

For the period of 17 July to 27 August 2010, which I looked at in the previous post, this now reduces the number of total links from over 6.8 million to just under 3.4 million. I’ll post a visualisation of the network in my next post.

Feature image from GNU Operating Systems.

2 replies on “More Blog Network Data Cleaning with Gawk”

Pingback: Mapping Online Publics » Blog Archive » Mapping the Australian Blogosphere Some More
Craig Thomler says:

6 October 2010 at 22:15

This is brilliant work!

I’ll be closely following it.

I am unsure why, however, my blog (eGovAU) appears green – with arts and craft blogs – when it is a professional, government focused blog on Government 2.0 and egovernment. I would have thought it fit into the light green cluster…

Cheers,

Craig

Comments are closed.

Published by Snurb

2 replies on “More Blog Network Data Cleaning with Gawk”