Update: revision 1.2 of metrify.awk is now available (still at the link below), and introduces some further functionality, which is outlined here.
Over the past couple of posts, I’ve introduced our new metrify.awk Twitter metrics script, and looked at the first of the three metrics tables produced by the script. Let’s move on now to the second table, where I’ll use a snapshot of Australian political discussion on Twitter under the #auspol hashtag between February and August 2011, instead of #qldfloods – the overall metrics for the different user percentiles in the #qldfloods dataset turn out not to be particularly interesting… As before, we’re dividing the total userbase according to the 1/9/90 rule into the 1% of most active users, the next 9% of moderately active users, and the final 90% of least active users. (In the case of #auspol, that first percentile contains 142, the second percentile contains 1291, and the final percentile contains 12700 of a total of 14133 users.)
The second table generated by metrify.awk provides us with detailed metrics on these three percentiles, on an overall basis rather than per specific time period.
This table contains the following columns:
- percentile: the various percentiles making up the userbase, as well as total metrics for the entire userbase
- various stats on tweets of these different types: the types match those we’ve already encountered in the previous blog post, and stats on these tweet types are provided in each case as total numbers, and as a percentage of the number of tweets posted by the user percentile in question
- original tweets
- genuine @replies
- unedited retweets
- edited retweets
- tweets containing URLs
Again, too, these figures will add up to the total:
- edited retweets + unedited retweets = retweets
- retweets + genuine @replies = @replies
- original tweets + genuine @replies + retweets = total number of tweets
- % edited retweets + % unedited retweets = % retweets
- % original tweets + % genuine @replies + % retweets = 100%
(with tweets containing URLs again constituting a separate category, since any type of tweet may also contain URLs).
Applying this to our
#royalwedding #auspol dataset, here’s what the activities of the different user percentiles look like:
We’re clearly seeing some very significant differences between the various percentile groups here. Interaction amongst the top 1% of most active users is especially discursive, with more than 55% of all of their tweets constituting genuine @replies: these people are very actively talking to (or at) one another.
The next lower group of active users, by contrast, doesn’t engage as much: only one third of their tweets are genuine @replies, but nearly 39% are original tweets. They’re more active at posting their own views and comments, rather than responding to others – or at least (and this is important to keep in mind with any such metrics), they’re less in the habit of also marking their @replies with the #auspol hashtag. By contrast, the top group are much more overtly performing their conversations, making them visible to all followers of #auspol; the second group may well send their own @replies, but if those @replies don’t contain the hashtag #auspol, they’re less visible to others and not included in our hashtag dataset.
Finally, too, the least active 90% of users are participating differently again: some 52% of their tweets are retweets, so (given that they’re not posting to #auspol that often in the first place) they’re probably more likely to be present here simply as ‘drive-by’ retweeters who occasionally pass along an interesting #auspol-tagged message that shows up in their Twitter feeds, but don’t deliberately follow the continuing #auspol conversation itself.
There are two more useful statistics to examine for #auspol, and I’ve combined them in the graph above: first, the percentage of the total volume of #auspol tweets that each group is responsible for (shown here in blue): the one percent of most active users – a total of 142 Twitter users, for the period we’re looking at – accounts for a staggering 62% of all #auspol tweets. In other words, Australian political discussion on Twitter, under the #auspol banner, is dominated by a vanishingly small group of users whose output is massively disproportional to the size of the group. Compare this with the least active 90%: those more than 12,000 users contribute less than 9% of all #auspol posts. Quite a difference – #auspol shows a very strong long-tail distribution amongst its active participants, then. (This is very different for many of the crisis-related hashtags we’ve looked at, by the way: the top 1% of most active users in #qldfloods, for example, are responsible for less than 17% of all tweets; the least active 90% of #qldfloods users for nearly 57%.)
Second, the distribution of tweets containing URLs is also interesting here. We already know that the lowest 90% are more likely to retweet than post their own commentary or @replies – and it looks like many of those retweets are of posts containing URLs: some 37% of all tweets by the bottom 90% include links. By contrast, the discursive few at the top of the activity scale include URLs in only 18% of their tweets.
Percentile Metrics, Compared
But beyond these metrics for the various user percentiles in individual hashtags, we can also compare these findings across different hashtag datasets – and that’s where things get really interesting. There are very many possible comparisons here: how do the individual percentiles of users compare across the different hashtags (something I’ve already hinted at above, comparing the relative contribution of the top 1% in #auspol and #qldfloods, for example), which hashtags contain more @replies, retweets, URLs, etc.?
We’ve only scratched the surface on these broader comparisons, but one very interesting pattern which has already emerged is shown in the graph below (which remains preliminary; one of my plans for the next month or so is to develop this further):
Here, we’re comparing the total metrics (for all users, rather than for specific percentiles) across a range of different hashtags: #qldfloods, #eqnz, the Japanese #tsunami, #libya, the #londonriots, #ukriots, and #riotcleanup, the #royalwedding, election nights in Australia and Ireland (#ausvotes and #ge11), the Tour de France (#tdf), #eurovision, and #wikileaks. The size of each point on the graph shows the total size of the userbase for each hashtag – so, the #royalwedding and the #tsunami attracted a vastly larger Twitter userbase (of around half a million unique users each) than the Irish election or Queensland floods, for example.
But what the graph shows is that independent of the size of the userbase, there are some very obvious patterns here. All of the crisis events are characterised by a large number of both (unedited) retweets and tweets sharing links; people are actively finding and disseminating information. All of the widely televised events, on the other hand, have very few URLs, and only marginally more retweets: Twitter may be used as a backchannel for the television, in a shared experience of audiencing, but there’s not much additional information sharing going on here. #wikileaks, in turn, is a different story altogether – but perhaps we’ll come across more hashtags with similar metrics, and it’s the first sign of a third major category.
I’m reluctant to read too much more into these patterns as yet – first, I’ll need to do some more work cleaning up the datasets which the graph above is based on (working out which exact periods of time to use for each hashtag, and trying comparisons of a few more different combinations of metrics. I do think there’s a first sign in this of much more fundamental patterns in how Twitter hashtags are used for specific purposes. But that’s a longer discussion for another time.
And we haven’t yet exhausted all the possibilities which metrify.awk itself offers. In addition to the time- and/or percentile-based metrics which we’ve discussed over these last couple of posts, it also calculates metrics for each individual user in the dataset. And that’s what we’ll look at in the final instalment in this series.