The first part of this post examined some of the basic stats on Twitter use during the 29 April 2011 royal wedding. Here, we’ll try something a little different: in the tweets using the #royalwedding hashtag between 00:00 and 23:59 GMT that day, what other hashtags were also used?

Hashtags, of course, aren’t mutually exclusive, and are often used for emphasis (or comic relief) as much as to make a genuine contribution to an existing conversational hashtag feed – or indeed, both at the same time. So, beyond #royalwedding as the key hashtag to be used to refer to the actual event itself, an examination of these other hashtags provides us with some useful nuances on how Twitter users perceived and contextualised the wedding, and a correlation of what secondary hashtags were used by which groups of users helps group these perspectives to some extent.

I must note that this way of examining hashtag use remains very experimental, and will still need some further thought – but that’s what these posts are for, and we’d very much appreciate your feedback on our thinking. So, then, what are we actually doing? In the first place, as I’ve explained in the previous post, we’ve started from our full archive of #royalwedding tweets from 29 April 2011, and processed this archive using a new Gawk script, hashcoocurrence.awk (included at the bottom of this post). The script runs through the following steps:

  1. For each user in the archive, it identifies all the various hashtags which the user has included in their tweets – in our case, #royalwedding (naturally), but also any secondary hashtags which may also occur in those tweets.
  2. For each user, the script then interprets any two of these various hashtags to be related. In other words, in order for two hashtags to be seen as related, they don’t have to occur in the same tweet – they simply must have been used by the same user, even if in separate tweets. (At least at present, the script does not add any weightings, however: rarely used hashtags will be just as strongly related to one another as more frequently used hashtags.)
  3. The script then lists these relationships in a simple network data format, in such a way that (when imported in to our network visualisation software, Gephi) they will accumulate weight: in other words, the more individual users have used two specific hashtags in their tweets, the more closely connected those two hashtags will appear.

Let’s run through a brief example to explain this: suppose Twitter user @1 sends a series of tweets containing the hashtags #a, #b, and #c. Regardless of whether those hashtags occurred in the same tweets or not, and regardless of how many tweets with those hashtags user @1 sent, this will result in the script seeing three connections: #a to #b, #b to #c, and #a to #c. Now add user @2, whose tweets contain #b, #c, and #d. Again, the script will see connections from #b to #c, #c to #d, and #b to #d. Put together, this creates what in good old-fashioned ASCII art will look as follows:

  #a                          #a
 /  \                        /  \
#b - #c   +   #b - #c   =   #b = #c
               \  /          \  /
                #d            #d

In other words, #b and #c have a connection with double the weight (since both @1 and @2 are using both those hashtags in their tweets), while the other connections have only single weightings (since the hashtags #a and #d are each unique to only one of the users). I should also note that all of these network connections are undirected, since we’re dealing here simply with hashtags occurring in tweets by the same user – so there is no distinct ‘source’ or ‘target’ implied in these connections (and as we import the dataset into Gephi, this means ‘undirected’ must be selected as an option).

As we apply this approach to our full royal wedding dataset, only one obvious problem remains: given that we’re already starting from a Twitter archive in which each tweet contains the #royalwedding marker, that hashtag will necessarily dominate the network, and all other hashtags will be connected to it. In this case, therefore (and the situation would be different if we’re dealing with other types of Twitter data), we’ve got to manually remove #royalwedding from the network in order to be able to see the other hashtags properly.

Once we’ve done so, here’s what results. First, here’s a reduced version of the graph, showing only the most widely used secondary hashtags (full PDF here – 610kB):
In the first place, it’s perfectly obvious that #rw2011 and #rw11 were important alternative hashtags to #royalwedding (and used in the same tweet with it – otherwise we wouldn’t be seeing them here). There’s an argument for manually removing them from this map, too, much as we’ve done with #royalwedding itself, but I didn’t want to mess with it too much for now. Things get more interesting once we look beyond those generic hashtags, anyway: there are some distinct zones in this map.

In the area immediately above #rw2011, we see a range of relatively prominent hashtags referring to the royals and their guests, in a number of ways – #william, #kate, #william&kate, #williamandkate, #princewilliam, #katemiddleton, and so on, all feature here, but also (zoom in a bit) #davidbeckham, #victoriabeckham, #princeharry, #queen, #thequeen, #princessdiana, #pippamiddleton, #pippa, as well as royal couturiers #sarahburton, #alexandermcqueen (or #mcqueen), and even #gracekelly. Interleaved with these names are various other trappings of the wedding ceremony, such as #abbey, #westminsterabbey (and for the dyslexic amongst us, #westministerabbey), #wedding, #dress, #eltonjohn, #fashion, #hats, #kiss, #thekiss, #theykissed, #balconykiss, #buckinghampalace, #congratulations, #fairytale, #love, and #royals. A few non-English hashtags also made it into the mix – I spotted the German #hochzeit (wedding) and #kleid (dress), for example. What I can’t quite explain yet is why #nfldraft also features prominently (at about two o’clock from #rw2011) – will need to have a closer look through the data to solve that one…

By contrast, below the #rw2011 hashtag (and around #rw2011), there’s a somewhat different group of hashtags: here, we find a range of more functional, and at times far more negative, secondary hashtags. First, a number of broadcasters are visible here: #cnntv, #cnn, #bbcwedding, #bbc, #abc, #foxnews, and #todayshow, plus #nowwatching as well as #fb (for Facebook) and #twitter itself. There are also prominent hashtags such as #makesmesmile, #whatsthepoint, #justsaying, #lol, and #wtf, as well as #proudtobebritish – and if you zoom in a little further, a few more minor hashtags from #royalhoneymoon to #royaldivorce, from #doctorwho to #whocares. There’s also another non-English hashtag here: #bodareal (Spanish for royal wedding). It’s very tempting to suggest, then, that the group of hashtags immediately above #rw2011 in the map is the work of those Twitter users who are highly invested in the event, following the pomp and circumstance in all its glory, while those below #rw2011 take a significantly more snarky, cynical perspective.

Further out from the centre of the map, a few other notable clusters are visible. Between about four and five o’clock from #rw2011, a smattering of hashtags spreads between #uk and #syria, and also includes #egypt, #libya, #bahrain, #yemen, and (a little further out) #usa: here, I expect we’ll find those users who criticise media coverage of the royal wedding as a distraction from the uprisings of the Arab spring. On the opposite site of the map, past #westminsterabbey, a tight cluster of hashtags references a range themes relating to the royal family: #royalty, #aristocracy, #lineage, #genes, #blood, and #breeding, as well as #unionjacks, #titles, #britishness, and – somewhat less charitably – #bigearedbiters. Interestingly, a few notable characters are especially closely associated with this cluster: #charles, #camilla (or #duchessofcornwall), #dianaspencer, #davidcameron, and #borisjohnson – and #corgi also appears.

What such clusters point to are persistent side conversations and comments, which – while referencing the #royalwedding hashtag (otherwise we wouldn’t have them in our dataset) – really only use it as a departure point for tweets about some very different issues. Most of the minor clusters around the map can be explained in this way, though some, frankly, are a little stranger than others – on the far right, for example, we find a cluster combining #france, #india, #australia, #pudding, #ashes, #sherlockholmes, #jedi, and (Shane, I presume) #warne, amongst others, in a strange collision between cricket and who knows what. However, we probably shouldn’t read too much into such far outliers; the fact that they are outliers means that we’re reaching the edges of network cohesion, anyway. Move a little closer in, and the clusters make more sense again – even if, like the smear of clusters at about eight o’clock from #rw2011, they are the result of some persistent Twitter spam and spam-like activity: #teamfollow, #ifollowback, #autofollow, #tfb, #1000aday, #500aday, and other variations on the theme are all indicative of ‘follow me and I’ll follow you back’ schemes to artificially (and meaninglessly) boost one another’s Twitter follower networks, for example.

Finally, then, let’s have a look at the big picture, with a map of all the 46,000 secondary hashtags occurring alongside #royalwedding (full PDF here – 5MB):
The positions of various major hashtags and clusters have necessarily shifted somewhat here, but the overall picture remains the same. What this more detailed map enables us to do is to explore the specific minor hashtags associated with more frequently occurring tags – zoom in, for example, to the Arab spring cluster to see the full range of minor hashtags which are mentioned there.

What this approach to visualising the #royalwedding dataset provides, then, is a little more insight into the way Twitter users engage in such major events, I think: some – many, even – clearly do participate in a very engaged, focussed fashion, staying on-topic and following every detail of the unfolding ceremony; others, by contrast, use the hashtag stream merely as a point of departure for some far more obviously off-topic conversations and comments, even if they continue to include the hashtag in their tweets; yet others, in fact, even deliberately retain the hashtag even while ostensibly talking about unrelated themes (such as the Arab spring) or simply spamming the hashtag feed, in order to make their messages visible to the potentially large audience following the hashtag feed itself. And in addition to all this, of course, we have a very long tail of inventive, idiosyncratic, and sometimes just incredibly odd uses of hashtags to emphasise and emote: in tweets which use the hash symbol as a Twitter-native alternative to emoticons or other extratextual cues.

What we must also keep in mind, of course, is that we’re dealing in this case exclusively with secondary hashtags: hashtags which were included in tweets alongside the main #royalwedding hashtag that was the criterion for including those tweets in our dataset in the first place. Further down the track, we’ll repeat this exercise for a number of different datasets (containing mentions of specific keywords or usernames, for example) – it will be interesting to see how our results will differ in those cases. We’ll also have a play at creating some of the heterogeneous Twitter maps which Bernhard Rieder over at The Politics of Systems has flagged – some very interesting ideas for further work…

 

Oh, and here’s the hashcooccurrence.awk scipt which we’ve used to process our data:

# hashcooccurrence.awk - create network of the hashtags used by each user
#
# this script takes a Twapperkeeper CSV/TSV archive of tweets, and creates a network of hashtags
# hashtags are taken to be connected with one another if they are used by the same user
# (regardless of whether they occur in the same or in multiple separate tweets)
#
# all hashtags are converted to lowercase
#
# notes:
# 1) this script generates an undirected network - it maps the co-occurrence of multiple hashtags
# in the update stream of each user it encounters in the dataset
# 2) each edge between hashtags appears twice: as a > b and as b > a - importing into Gephi as an
# undirected network results in edges a <> b of weight 1, however
#
# the script skips the first line, expecting that it contains header information
#
# expected data format:
# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time
#
# output format:
# hashtag-from, hashtag-to
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	getline
}

$1 ~ /\#/ {
	a=0
	do {
		match(substr($1, a),/\#([[:graph:]]+)/,atArray)
		a=a+atArray[1, "start"]+atArray[1, "length"]

		sub(/([[:punct:]]+)$/,"",atArray[1])

		if (atArray[1] != "") userhashs[$3] = userhashs[$3] "," tolower(atArray[1])

	} while(atArray[1, "start"] != 0)

}

END {
	print "hashtag-from" FS "hashtag-to"

	for(user in userhashs) {
		split (userhashs[user],hashs,",")
		for(term in hashs) {
			if(hashs[term] != "") {
				for(target in hashs) {
					if((hashs[term] != hashs[target]) && (hashs[target] != "")) hashedge["#" hashs[term] FS "#" hashs[target]] = TRUE
				}
			}
		}
	}

	for(edge in hashedge) {
		print edge
	}
}

Feature image by Beacon Radio.