Taking Twitter Metrics to a New Level (Part 2)

Update: I’ve clarified/corrected some of the details relating to the percentile metrics contained in the first table which metrify.awk generates.

Update 2: revision 1.2 of metrify.awk adds further functionality in addition to what is described below. These changes are detailed here.

In the previous post, I’ve introduced metrify.awk, our new multi-purpose tool for generating Twitter metrics. Over the next instalments in this series of posts, I’ll take you through the results it produces. And seeing as we’re coming up to the anniversary of the January 2011 south-east Queensland floods, and as I needed to generate those metrics anyway, for a report on social media in the floods which we’re publishing soon, I’ll be using an archive of #qldfloods tweets between 10 and 17 January 2011 as an example here.

I’m running metrify.awk as follows for this:

gawk -F , -f metrify.awk divisions=90,99 time=day qldfloods.csv >qldfloods-metrics.csv

In other words, we’re using a 1/9/90 division of users, and we’re tracking activities per day; the skipusers switch is not set, so full stats for all users will be generated.

Metrics over Time

The output file from this, qldfloods-metrics.csv, contains three separate data tables in the same spreadsheet, which I’m now loading into Excel. The first of these contains the following information:

day (in my case, otherwise minute, hour, month, year): each time period covered by the dataset
tweets: total number of tweets for that period
users: total number of unique users posting tweets for that period
various stats on tweets of these different types: these are provided as stats per user, as total numbers, and as percentages of the total number of tweets for each period
- original tweets: tweets which are neither @reply nor retweet
- retweets: manual retweets which contain any of RT @user… / “@user… / MT @user / via @user
- unedited retweets: manual retweets which start with any of RT @user… / “@user… / MT @user / via @user
- edited retweets: manual retweets which contain, but don’t start with any of RT @user… / “@user… / MT @user / via @user
- genuine @replies: tweets which contain @user, but are not retweets
- URLs: tweets which contain URLs
stats for the various percentiles of users: in my example, following the 1/9/90 division
- lowest 90% users (< a tweets) as a percentage of the total number of users
- users > 90% (> b tweets; x of n users) as a percentage of the total number of users
- users > 99% (> c tweets; y of n users) as a percentage of the total number of users
- (further stats for those user percentiles were introduced in metrify 1.2 – details are here)

Some more side notes are required here: first, as you already know, Twapperkeeper / yourTwapperkeeper does not capture ‘button’ retweets – so all we can examine in the retweet department are ‘manual’ retweets. We count tweets as retweets if they follow any of the four formats listed above (RT = retweet, “@user = quoted tweet, MT = manual retweet, via @user); between them, these formats capture the overwhelming majority of retweets, but some very unusual retweeting formats will slip through the cracks. We also distinguish between edited and unedited retweets simply by checking whether the tweet in question starts with these retweet indicators, or not; that’s the only reliable way of checking without entering vastly more complicated territory. Again, this will miss retweets where the retweeting user added comments at the end of the retweet; these will be (incorrectly) counted as unedited retweets.

These different tweet types will always add up to the total:

edited retweets + unedited retweets = retweets
original tweets + genuine @replies + retweets = total number of tweets

and

% edited retweets + % unedited retweets = % retweets
% original tweets + % genuine @replies + % retweets = 100%

and

edited retweets:user + unedited retweets:user = retweets:user
original tweets:user + genuine @replies:user + retweets:user = total tweets:user ratio

(The odd ones left out from this are the stats on URLs, since URLs may be contained in original tweets as much as in @replies or retweets.)

Second, you see there the stats for our three (in my case) user percentiles make their first appearance. In my example, the following three column headings appear in the table:

lowest 90% users (<= 4 tweets)
users > 90% (> 4 tweets; 1670 of 15581 users)
users > 99% (> 18 tweets; 177 of 15581 users)
(further stats for those user percentiles were introduced in metrify 1.2 – details are here)

This already provides us with some information about how the percentiles ended up being defined in this case (more detailed information appears in the second table generated by metrify.awk – more on that later). First, the activity cutoffs: the least active 90% of users were defined as users who contributed 4 tweets or less to the total dataset; the middle group contributed more than four and up to 18 tweets; the most active 1% of users contributed more than 18 tweets over the entire duration covered by the dataset.

Additionally, we also see the numbers of users included in each group: 177 users posted more than 18 tweets; another 1670 users posted more than 4 and up to 18 tweets, and the rest (15581 – 1670 – 177 = 13734) posted 4 tweets or less. This also exemplifies the slight size creep which I’ve mentioned before: the 177 users in the top group are actually 1.14% of the total group (rather than 1%), the 1670 in the next lot are 10.72% (rather than 9%). If the creep gets too big for your liking, you could adjust the division cutoffs slightly (I could have used divisions=91,99 as a parameter to try to make the middle group smaller, for example).

At any rate, what the data in these columns track is ~~what percentage of the total volume of tweets for each time period is contributed by each of the user percentiles~~ the percentage of the total number of unique users during each period which belong to each of the percentile groups – in other words, the extent to which any of these groups dominate the hashtag feed at any one point. Note that which users get to be in which percentile is determined once, for the entire dataset, rather than on a per-time period basis: what these columns indicate, therefore, is how ~~active~~ present the overall lead (and other) user groups are in each time period, rather than how much a changing current group of most active users have contributed in each time period.

(Again, please note that further stats for those user percentiles were introduced in metrify 1.2 – details are here.)

Some Results

Time for some first results from this table, then. What these data allow us to do is already quite useful, and I’ll only provide a handful of examples here; you can experiment further on your own. Using my #qldfloods data, and selecting just this first table of metrics from the metrify.awk output, I’ll create a pivot table in Excel, which enables me to plot various metrics over time, for example:

This first table simply shows that the number of unique participating users, and the volume of tweets posted under the hashtag #qldfloods, move together over time; for most hashtags, that’s what you’d expect to see, I think.

Next, we see how different types of tweets contribute to the overall volume of tweets. Retweets (which I haven’t divided into edited and unedited retweets here) are quite prominent at the start of the crisis – as everyone is looking to share what little information is already available – and gradually drop down towards the end (as more information is available, and retweeting isn’t as important any more; there’s a big tick up on the last day, but the overall volume of tweets is very low then, so this may be an outlier); @replies gradually rise, on the other hand (perhaps because there’s a shift from simply sharing news and information to discussing how best to organise the recovery effort). URLs also rise gradually – possibly a sign of more and better information becoming available.

Finally, a look at our user percentiles: what we see here is that the ‘lead’ users aren’t actually that ~~active~~ prominent, especially during the busiest days for the hashtag (11-13 January): on those days, even the top two user percentiles combined don’t account for more than 20% of all ~~messages~~ unique users. This shouldn’t be misunderstood to mean that these top users were being drowned out by the hoi polloi, though: rather – given what we’ve already found out about retweeting rates in the previous graph – much of what the least active 90% of users were doing during these days was to retweet the messages of those lead users. (From all we’ve seen so far, this is a pattern common to crisis-related hashtags; it may be very different for a non-crisis case.)

We’ll see more evidence of this, in fact, when we turn to the next metrics table produced by metrify.awk – in the next post in this series…

Taking Twitter Metrics to a New Level (Part 2)

Metrics over Time

Some Results

Published by Snurb

3 replies on “Taking Twitter Metrics to a New Level (Part 2)”