So, 2011 is finally over – and what a year it’s been. While the confluence of natural disasters, political crises, and other major events has also provided us with the basis for a new research programme in crisis communication, let’s hope that 2012 is a little less intense, please…
To start the new year on a positive note, I’m finally getting around to sharing some more information about the new approach to generating Twitter metrics which we’ve developed over the past few months – this actually started during the research workshops we had with Stefan Stieglitz’s group at the University of Münster in August, so it’s taken some time to gestate into its present form. What it’s now turned into is quite a powerful tool for generating detailed information about a specific Twitter dataset – intended mainly for the study of hashtags, but with applications well beyond this as well. Amongst other things, it enables us to distinguish more effectively between different groups of participating users (from highly active lead users to much less active casual participants), and to track different types of participation, in total or by these specific groups, over time.
Update: revision 1.2 of metrify.awk is now available (still at the link below), and introduces some further functionality, which is outlined here.
The Gawk script we’re using for this is called metrify.awk, and it’s available here (ZIP file; you’ll need to unpack it). Metrify.awk is unusual in that it generates three different results tables within the one output CSV/TSV file; the output file is designed to be opened in Excel or another spreadsheet software, for further interrogation or charting. Metrify.awk takes standard Twapperkeeper or (with our modification) yourTwapperkeeper archives as input, and it’s run from the command line as follows:
gawk -F , -f metrify.awk divisions=list of percentiles time=time period skipusers=1 (if needed) tweets.csv >metrics.csv
I’ll explain the parameters as we go through the different results tables which metrify.awk generates. Obviously, use \t instead of , if you’re dealing with tab- rather than comma-separated datasets.
Distinguishing Different User Groups
The first aim of metrify.awk is to develop better distinctions between different groups of users. In any Twitter hashtag dataset, for example, there will usually be a long tail-style distribution of activity: from a handful of highly active ‘lead users’ through to a larger number of hangers-on who may only be present in the dataset because they happened to retweet a hashtagged message, without even paying much attention to the existence of the hashtag in the first place. In many circumstances, we might want to focus only on that central group of most active users, or examine how they act differently from the more marginal members of the hashtag community (to the extent that hashtags can be considered as communities in the first place, obviously).
One standard tool for making such distinctions is the division of the total userbase into different percentiles of more or less active users (where ‘active’ is measured simply by how many tweets they’ve contributed to the hashtag). Common such divisions include the so-called 10/90 or 1/9/90 rules, which distinguish variously between the top 10% of users and the rest of the group, or between the top 1% of lead users, the next 9% of engaged but less active users, and the rest. In other circumstances, a more even division of the userbase into two halves (i.e. 50/50) or four quarters (25/25/25/25) might also be useful.
Metrify.awk supports any such divisions through the divisions command-line parameter. This parameter specifies where the divisions should be made, through a comma-separated list of cutoff points, counting from the least active users to the top: “90″, for example, creates a 10/90 division, “90,99″ creates a 1/9/90 division, “25,50,75″ divides the userbase into four quarters of the same size. “10″, by contrast, would put the bottom 10% of least active users into one group, and the rest of the userbase into another. If divisions is not specified, metrify.awk defaults to “90,99′” – i.e. dividing the userbase according to the 1/9/90 rule.
One note of caution on these divisions: imagine you’re dealing with a group of ten users, for which you’d like to apply the 10/90 rule – that is, the top user is counted in a different category from the rest. What if two or more of these ten users share the top spot – i.e., they have contributed the same number of tweets? Which of them should be counted as the top user, which of them should be counted with the rest? Such a small group of only ten users is an extreme example, of course – most of the hashtag datasets to which we’re applying metrify.awk will include thousands or tens of thousands of unique users. But still, the problem can occur here, too.
In such cases, metrify.awk takes an inclusive approach, counting from the top on down: if users on either side of the boundary between the first and second percentile group have the same number of tweets, those in the lower group are also moved to the higher group. In our example above, all the users sharing the top number of tweets would be counted in the top percentile, for example, even if this means that the division ends up 20/80 or 30/70 instead. For larger datasets, the effects are usually far less extreme: a 10/90 division might blow out to 11/89 or 12/88 instead, but I think that’s preferable to making an arbitrary choice between equally active users.
Where things can get more problematic is with more even divisions (e.g. 25/25/25/25) and strong long-tail distributions of activity. Here, it’s quite possible that both the lowest 25% and the next higher 25% consist of users who have contributed only one tweet to the total dataset. In such cases, all those users will end up in a combined percentile which actually covers all of the bottom 50%, with the fourth quarter percentile remaining empty. If you see this in your own results, choose different division points (e.g. 25/25/50, i.e. “50,75″ as the command-line parameter).
Tracking Metrics over Time
The second metrify.awk parameter makes it possible not just to generate overall metrics for these different groups of users, but also to track their participation over time. Here, we’re able to specify the specific time period which we’re interested in tracking: options are “minute”, “hour”, “day”, “month”, or “year”, and should cover all eventualities. What time period is appropriate in each case depends on the nature of the hashtag dataset, of course: for a day-long event like the #royalwedding last year, “minute” may be useful; to understand longer-term developments like #egypt or #libya, “day” or even “month” might be better.
Note: metrify.awk expects the input file to contain tweets in chronological order, from the earliest to the latest. Before exporting your tweet archives from yourTwapperkeeper, make sure you select ‘ascending’ as the display order, or use a spreadsheet software to reorder the exported data in ascending order by the timestamp field – otherwise, the result may not be what you expected.
Extracting Individual User Stats
Finally, in addition to these aggregate stats, metrify.awk also generates per-user metrics, which I’ll discuss in more detail below. With large datasets, this is the most time-intensive aspect of what metrify does, though – and if you’re only interested in the overall metrics, and not in the details of how individual users fared, generating these additional statistics is overkill. So, metrify includes a command-line switch which turns off metrics generation for individual users almost completely (other than simply counting the number of tweets they’ve contributed to the dataset): skipusers=1. Including this on the command line will considerably speed up processing.
OK, with these preliminaries out of the way, it’s time to take metrify.awk for a spin, and to see what it produces. We’ll do so in the next post in this series…