{"id":2993,"date":"2015-03-31T17:57:44","date_gmt":"2015-03-31T07:57:44","guid":{"rendered":"http:\/\/mappingonlinepublics.net\/?p=2993"},"modified":"2015-03-31T18:06:10","modified_gmt":"2015-03-31T08:06:10","slug":"metrics-for-analysing-twitter-communities-using-tcat-and-tableau","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/31\/metrics-for-analysing-twitter-communities-using-tcat-and-tableau\/","title":{"rendered":"Metrics for Analysing Twitter Communities, Using TCAT and Tableau"},"content":{"rendered":"<p>This post builds on the new approach to transforming <em>Twitter<\/em> datasets generated by the <em>TCAT<\/em> tracking tool for analysis in Tableau which I\u2019ve introduced in my <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">recent<\/a> <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/\">posts<\/a>. Often, we will be interested in exploring the structure of <em>Twitter<\/em> communities as they form around given hashtags or keywords \u2013 for instance to examine whether they really act as <em>communities<\/em> in a narrow sense, or are rather merely groups or publics who are in some way connected to the hashtag, but barely aware of each other\u2019s presence.<\/p>\n<p>In the past, we\u2019ve used one of our Gawk scripts, metrify.awk, to generate a range of metrics which provided detailed information on the dynamics of a dataset over time, across individual users, and across different groups of accounts as defined by their level of activity; I explained that process in a multi-part post in 2012 (<a href=\"http:\/\/mappingonlinepublics.net\/2012\/01\/02\/taking-twitter-metrics-to-a-new-level-part-1\/\">1<\/a>, <a href=\"http:\/\/mappingonlinepublics.net\/2012\/01\/02\/taking-twitter-metrics-to-a-new-level-part-2\/\">2<\/a>, <a href=\"http:\/\/mappingonlinepublics.net\/2012\/01\/02\/taking-twitter-metrics-to-a-new-level-part-3\/\">3<\/a>, <a href=\"http:\/\/mappingonlinepublics.net\/2012\/01\/02\/taking-twitter-metrics-to-a-new-level-part-4\/\">4<\/a>, and <a href=\"http:\/\/mappingonlinepublics.net\/2012\/01\/31\/more-twitter-metrics-metrify-revisited\/\">follow-up<\/a>). With the move from <em>yourTwapperkeeper<\/em> and Excel to <em>TCAT<\/em> and Tableau, most of this analysis can now be done directly within Tableau itself, directly from the source <em>TCAT<\/em> dataset and the additional helper datasets which our <em>TCAT-Process<\/em> scripts generate. What\u2019s still missing from the mix is a method for exploring the contribution of the different groups of accounts, though \u2013 this post outlines the steps for generating these metrics from within Tableau itself.<\/p>\n<h3>Introducing Percentile Groups<\/h3>\n<p>It\u2019s well established that the distribution of activity levels across a given group of social media accounts will often follow a \u2018long tail\u2019 distribution: a very small number of accounts are very heavy contributors to a hashtag or a discussion, while a large number of others are contributing only very occasionally. The exact balance between these groups, and the exact nature of their respective contributions, can tell us a great deal about the dynamics of the overall <em>Twitter<\/em> public gathered around the shared hashtag or theme \u2013 the lead users may contribute in different ways from the least active users, for example by including more URLs in their tweets, or by taking a more discursive approach that features more @replies than retweets. We\u2019ve used such observations very effectively in the past to <a href=\"http:\/\/www.tandfonline.com\/doi\/full\/10.1080\/15228835.2012.744249#.VRoxjfmUdZ0\">distinguish between different types of hashtag events<\/a>, and to pinpoint useful areas for further close reading of tweets.<\/p>\n<p>What\u2019s often used in this context is a 1\/9\/90 division between participants: ordered by their number of contributions to the conversation, the top 1% of accounts are identified as lead users; the next 9% as highly active users; and the remaining 90% as least active users. Other divisions are also possible, of course; what is most appropriate will depend on the specific dataset at hand, and on the research questions asked of it. For the purposes of this post, we\u2019ll continue with the 1\/9\/90 division of accounts into three percentile groups.<\/p>\n<p>Happily, it is fairly straightforward to create these percentile groups in Tableau. In the following discussion, I\u2019m building on the processes outlined in my <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">previous<\/a> <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/\">posts<\/a>: so, we\u2019ve already downloaded a full dataset export from <em>TCAT<\/em> (in my example, tweets about the attempted party leadership challenge to Australian Prime Minister Tony Abbott, using the #libspill hashtag and a number of related hashtags and keywords), and we\u2019ve processed this dataset using the <em>TCAT-Process<\/em> scripts package I\u2019ve made available <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">here<\/a>. We\u2019ve also loaded and combined the resulting datasets in Tableau.<\/p>\n<p>Now, the first new step is to create a new calculated field called \u2018Percentile Ranking\u2019 in Tableau, using the following formula:<\/p>\n<blockquote><p>RANK_PERCENTILE(COUNTD([Id]))<\/p><\/blockquote>\n<p>As usual, we are using COUNTD(Id) as the most reliable count of unique tweets; in a given list of items in Tableau, the RANK_PERCENTILE() formula then uses this count of tweets to calculate which percentile in the list a specific item occupies. The result will be a value between 0 (lowest percentile, 0%) and 1 (highest percentile, 100%).<\/p>\n<p>In Tableau, we can now graph CNTD(Id) against From User Name, order the list by CNTD(Id), and add Percentile Ranking as a label; this generates an ordered list of participant accounts and shows their percentile ranking:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image17.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb17.png\" alt=\"image\" width=\"644\" height=\"273\" border=\"0\" \/><\/a><\/p>\n<p>By this ranking, accounts with a Percentile Ranking greater than 0.99 are in the top 1% of lead users; accounts with a ranking between 0.9 and 0.99 are in the next 9% of highly active users; and the remainder of accounts with a ranking below 0.9 are in the bottom 90% of least active users.<\/p>\n<p>However, in our further analysis we cannot use the Percentile Ranking field directly, as it is always freshly calculated depending on what fields are graphed against each other; we therefore have to persistently allocate accounts to the three groups we\u2019ve defined. This is where Tableau gets uncharacteristically cumbersome for a moment:<\/p>\n<ol>\n<li>First, add Percentile Ranking to Filters, and filter for accounts with a ranking of at least 0.99:<br \/>\n<a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image18.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb18.png\" alt=\"image\" width=\"374\" height=\"259\" border=\"0\" \/><\/a><\/li>\n<li>Next, click anywhere into the white space in the graph, and press CTRL-A to select all visible rows. Mouse over one of the rows, and click the paperclip icon to create a group:<br \/>\n<a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image19.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb19.png\" alt=\"image\" width=\"369\" height=\"162\" border=\"0\" \/><\/a><\/li>\n<li>The new group will appear in the Dimensions sidebar, as \u201cFrom User Name (group)\u201d \u2013 rename it to something more meaningful, such as \u201cLU\u201d (for Lead Users).<\/li>\n<li>Remove LU from the Color field, where Tableau has placed it automatically.<\/li>\n<li>Change the Percentile Ranking filter to a range of values between 0.9 and 0.99, and repeat the previous steps \u2013 call this new group \u201cHA\u201d (for Highly Active).<\/li>\n<li>We now have two separate groups of accounts \u2013 LU and HA \u2013 and at least implicitly also a third group of accounts who belong to neither LU nor HA: these are our 90% of least active users. The final step is to combine these individual groups into one unified categorisation scheme which we can use in our further analysis: in the Dimensions sidebar, shift-click on the two groups to select them, right-click on one of them, and select Combine Fields \u2013 this finally creates a new combined field that permanently assigns each user to one of the three groups:<br \/>\n<a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image20.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb20.png\" alt=\"image\" width=\"336\" height=\"320\" border=\"0\" \/><\/a><\/li>\n<li>Finally, let\u2019s rename the new combined field \u201cLU &amp; HA\u201d to \u201cSender Groups\u201d.<\/li>\n<\/ol>\n<p>(The overall process will remain the same for different percentile cutoffs, of course, and is even easier if you make only a simple distinction between a lead user group and the remainder of the userbase \u2013 for a 20\/80 split, for example, simply create one group for accounts ranked above .8. However, finer gradations between multiple subgroups usually generate more useful analysis.)<\/p>\n<h3>Analysing the Groups\u2019 Contributions<\/h3>\n<p>Having created these groupings, we can now begin to use them in our analysis. First, we should determine how many accounts there are in each of our groups, by showing the count of unique user names \u2013 i.e. CNTD(From User Name) \u2013 for each of the groups (note that I have also added a Grant Total row by selecting Analysis &gt; Totals &gt; Show Column Grand Totals):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image21.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb21.png\" alt=\"image\" width=\"320\" height=\"269\" border=\"0\" \/><\/a><\/p>\n<p>By default, Tableau names the groups after the combination of selection criteria that the combined Sender Groups field was constructed from \u2013 but from the membership size of the groups listed above we know that the smallest group (the first row in the image above) must be the 1% of lead users, the second the 9% of highly active users, and the third the remainder of the 90% least active users. Right-clicking on each field and selecting \u201cEdit Alias\u2026\u201d allows us to rename these fields to something more user-friendly.<\/p>\n<p>It is also notable that while my dataset contains a total of 134,518 unique user names, the lead user group is made up of 1,347 accounts, and the two top groups together number 14,396 accounts in total \u2013 more than the 1% or (combined) 10% of 134,518 that they should contain. This is not an error, but simply a sign that Tableau does not play favourites: if there are multiple accounts at the boundary between two groups which equally fulfil the requirements for belonging to the higher-ranked group, Tableau will include them all, rather than arbitrarily sending some of them to the lower group in order not to expand the higher percentile group beyond the top 1% or 10%. In my sample dataset, for example, the cutoff for belonging to the lead user group was a total of 43 tweets sent, and multiple accounts had reached exactly that number.<\/p>\n<p>Next, we might want to explore the number and types of tweets sent by each group \u2013 so here, I\u2019ve graphed Sender Groups against CNTD(Id), and coloured by Type (again with an added Grand Total column). What becomes evident in my sample is that the 1,347 lead users contributed more tweets than each of the other two groups, and that they were especially active in sending @mentions and retweets:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image22.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb22.png\" alt=\"image\" width=\"441\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<p>Replacing Type with Hashtag as the field determining colour, we can also determine highly divergent hashtagging practices \u2013 the lead users almost always included a hashtag, while almost two thirds of the tweets posted by the least active users did not contain hashtags (note again that the tweet numbers are increased beyond the previous graph here, and the percentages add up to more than 100%, because tweets can contain two or more hashtags). Incidentally, I\u2019ve displayed the percentages by adding CNTD(Id) to Label and calculating its value as a percentage of total, using Table (Down) as the calculating method:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image23.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb23.png\" alt=\"image\" width=\"445\" height=\"484\" border=\"0\" \/><\/a><\/p>\n<p>Many further permutations of these analyses are also possible, of course \u2013 we might explore, for example, whether there are differences in the URLs each group are sharing (are they using distinctly different domains as their sources of information?), or whether they are tweeting from notably different devices (as indicated by the Source field).<\/p>\n<p>Further, we can also examine the contributions by these groups over time:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image24.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb24.png\" alt=\"image\" width=\"483\" height=\"341\" border=\"0\" \/><\/a><\/p>\n<p>In my example, this shows that the lead users are responsible for a volume of tweets that usually closely matches that contributed by the (much larger) group of highly active users, while the least active users are less engaged during peak times, but make up for this by maintaining greater levels of activity outside of peak periods. This could also indicate that the least active group contains a range of users whose tweets showing up in our dataset as false positives (e.g. because they use the term \u2018spill\u2019 in non-#libspill-related contexts), which could be a good argument for excluding this group from the analysis altogether.<\/p>\n<p>Using Percentile Groups Elsewhere<\/p>\n<p>While this post has focussed on defining groups of accounts based on their active contributions to a dataset (i.e. the number of tweets they posted), the same approach can also be used for other distributions where grouping may be useful. For example, we might instead list the accounts being @mentioned (based on the To User field which the <em>TCAT-Process<\/em> scripts generate \u2013 not the unreliable To User Name field which the <em>Twitter<\/em> API itself provides) against the number of tweets mentioning them (via CNTD(Id)), and again calculate their percentile ranking. In fact, we can use the Percentile Ranking field we defined at the start of this post \u2013 it performs a new calculation for any list of items it is being applied to:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image25.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb25.png\" alt=\"image\" width=\"429\" height=\"311\" border=\"0\" \/><\/a><\/p>\n<p>We should exclude \u201cNull\u201d from this list (which collects all the tweets which did not @mention or retweet another user), and can then again define a number of percentile groups following the process outlined above. For my sample dataset, this results in a group of 454 \u201cMost Visible\u201d accounts (the 1% of accounts who received the most @mentions and retweets), 4,596 \u201cHighly Visible\u201d accounts (the next 9%), and 40,272 \u201cLeast Visible\u201d accounts (the remaining 90%). Note here, though, that by default Least Visible will contain the \u201cNull\u201d recipient (as Least Visible is simply a collection of all recipients that are not included in the other two groups), so we will need to manually filter out this recipient from all further analysis.<\/p>\n<p>I\u2019ve combined these groups into a Receiver Groups field, which we can now also use for some interesting analysis. First, for example, as expected the most visible accounts command the majority of @mentions and retweets:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image26.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb26.png\" alt=\"image\" width=\"476\" height=\"446\" border=\"0\" \/><\/a><\/p>\n<p>Second, they are also especially popular with the most <em>active<\/em> participants in the discussion. Note especially how the least visible accounts are mainly mentioned by the least active participants \u2013 it seems that there are several separate discussion circles here:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image27.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb27.png\" alt=\"image\" width=\"470\" height=\"449\" border=\"0\" \/><\/a><\/p>\n<p>And finally, turning the interactions between the various sender and receiver groups into a matrix and adding some further Tableau functionality into the mix, here\u2019s a nice graph to end on. This shows the volume of activity from each sender to each receiver group, and breaks it down into @mentions and retweets:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image28.png\"><img decoding=\"async\" loading=\"lazy\" style=\"background-image: none; margin: 0px; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb28.png\" alt=\"image\" width=\"644\" height=\"436\" border=\"0\" \/><\/a><\/p>\n<p>Again, similar rankings can of course also be created for many of the other fields in our dataset \u2013 for example for the most frequently shared URLs (at the domain level, or for each fully qualified URL), the most prominent hashtags, even the most widely used tweeting platforms. Given the approaches I\u2019ve outlined here, I hope these will be relatively easy to calculate now.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This post builds on the new approach to transforming Twitter datasets generated by the TCAT tracking tool for analysis in Tableau which I\u2019ve introduced in my recent posts. Often, we will be interested in exploring the structure of Twitter communities as they form around given hashtags or keywords \u2013 for instance to examine whether they &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/31\/metrics-for-analysing-twitter-communities-using-tcat-and-tableau\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Metrics for Analysing Twitter Communities, Using TCAT and Tableau&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":2991,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","footnotes":""},"categories":[176,8,177],"tags":[286,7,297,284,291],"class_list":["post-2993","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-processing","category-twitter","category-visualisation","tag-data-processing","tag-gawk","tag-methods","tag-tableau","tag-tcat","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2993","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=2993"}],"version-history":[{"count":3,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2993\/revisions"}],"predecessor-version":[{"id":2997,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2993\/revisions\/2997"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/2991"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=2993"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=2993"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=2993"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}