{"id":2945,"date":"2015-03-02T09:05:52","date_gmt":"2015-03-01T23:05:52","guid":{"rendered":"http:\/\/mappingonlinepublics.net\/?p=2945"},"modified":"2015-03-02T09:23:11","modified_gmt":"2015-03-01T23:23:11","slug":"using-gawk-to-prepare-tcat-data-for-tableau-part-2","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/","title":{"rendered":"Using Gawk to Prepare TCAT Data for Tableau, Part 2"},"content":{"rendered":"<p>In <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">my previous post<\/a>, I introduced a new set of Gawk scripts to extract a range of additional information from standard <em><a href=\"https:\/\/github.com\/digitalmethodsinitiative\/dmi-tcat\/wiki\">TCAT<\/a><\/em> datafiles, in order to enable their use for data exploration, analysis, and visualisation in <a href=\"http:\/\/www.tableau.com\/\">Tableau<\/a>. After running <a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/02\/tcat-process.zip\">the TCAT-Process scripts<\/a>, we now have the following datafiles:<\/p>\n<ul>\n<li><em>datafile<\/em>.csv \u2013 the original <em>TCAT<\/em> dataset (a full export of all tweets and metadata)<\/li>\n<li><em>datafile<\/em>-tweettypes.tab \u2013 a helper table which for each tweet lists the users mentioned, and the type of mention (\u2018@mention\u2019 or \u2018retweet\u2019) made for each user; for tweets not containing any mentions of other users, the type is set to \u2018original\u2019.<\/li>\n<li><em>datafile<\/em>-hashtags.tab \u2013 a helper table which for each tweet lists the hashtags it contains.<\/li>\n<li><em>datafile<\/em>-urlpaths.tab \u2013 a helper table which for each tweet lists the short URLs and resolved destination URLs it contains, as well as the domain and domain and first path component of the resolved URL.<\/li>\n<\/ul>\n<p>In combination, these datafiles enable a very wide variety of <em>Twitter<\/em> analytics in Tableau \u2013 replacing many of the processing steps that previously required a number of additional Gawk scripts, in fact.<\/p>\n<p>In this post, I\u2019m using a <em>TCAT<\/em> dataset on <a href=\"http:\/\/www.smh.com.au\/federal-politics\/the-pulse-live\/politics-live-tony-abbott-faces-liberal-leadership-spill-threat-20150209-139a1q.html\">the recent attempted leadership spill in the Australian federal Liberal Party<\/a> to demonstrate how to use these datafiles for <em>Twitter<\/em> analysis.<\/p>\n<h3>Linking the TCAT-Process Outputs in Tableau<\/h3>\n<p>The first step in using these datafiles is to relate them to each other in Tableau. So, after opening Tableau, I\u2019m first connecting to the original <em>Tableau<\/em> dataset. Second, I\u2019m dragging the <em>tweettypes.tab<\/em> datafile which the TCAT-Process script has created next to my main dataset. By clicking on the circles between them, I can now choose a Left Join, and Tableau already correctly suggests joining them on the ID field:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb.png\" alt=\"image\" width=\"644\" height=\"236\" border=\"0\" \/><\/a><\/p>\n<p>This means that for every tweet ID listed in the main <em>TCAT<\/em> dataset, Tableau now also reads in the datapoints from the <em>tweettypes.tab<\/em> datafile that relate to the same tweet ID. Note that \u2013 because tweets can mention multiple other users at the same time, and can thus be both @mentions and retweets at the same time \u2013 this means that the same tweet in the main dataset can now have multiple corresponding datapoints from the <em>tweettypes.tab<\/em> file.<\/p>\n<p>We\u2019re repeating the same process with the <em>hashtags.tab<\/em> and <em>urlpaths.tab<\/em> datafiles. Again, we\u2019re using a Left Join, and again this means that multiple datapoints from these datasets may relate to the same tweet (if it contains multiple hashtags or URLs). By clicking on the gears icon that appears when mousing over the datafile names, a number of other useful parameters can also be set:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image1.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb1.png\" alt=\"image\" width=\"644\" height=\"261\" border=\"0\" \/><\/a><\/p>\n<p>Most importantly, the character set should be set to UTF-8, and (to be on the safe side) the text qualifier should be set to None \u2013 but make sure it remains set to double quotes (&#8220;) for the main <em>TCAT<\/em> dataset.<\/p>\n<p>Once all of these parameters are set, we can select the Connection method (make sure you use Extract for larger datasets), and click on Go to Worksheet.<\/p>\n<h3>Working with TCAT Data in Tableau<\/h3>\n<p>Before we begin the analysis, some housekeeping and setup will be useful. Tableau tends to place the tweet ID fields in its list of Dimensions, but we\u2019ll be wanting to use a count of unique IDs in many of our analyses, so drag the ID field from each of the four data tables into the Measures list, and select Count (Distinct) as the default aggregation:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image2.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb2.png\" alt=\"image\" width=\"644\" height=\"160\" border=\"0\" \/><\/a><\/p>\n<p>Further, for some datasets there may also be a need to adjust the timestamp of tweets in order to account for different timezones. To do so, right-click on the Created At field, create a calculated field, and use Tableau\u2019s DATEADD function to adjust the time. My #libspill data is using AEST, for example, but as Canberra (where most of the spill action took place) is on daylight savings time and thus one hour ahead of AEST, I\u2019m creating a new AEDT time field which adds one hour to AEST:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image3.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb3.png\" alt=\"image\" width=\"284\" height=\"122\" border=\"0\" \/><\/a><\/p>\n<p>By dragging AEDT into Columns, and ID into Rows (where it will automatically become a count of unique IDs \u2013 CNTD(ID)) we can now chart the overall volume of tweets. Note that we\u2019re using CNTD(ID) rather than the non-unique CNT(ID) or simply Number of Records because by joining our helper tables with the main dataset we now have more individual records than unique tweet IDs: each second or third user mention, URL, or hashtag in a tweet adds to the total count of records, so that \u2013 in my example \u2013 the 435,000 unique tweets contained in the #libspill datasets have become almost 980,000 records in the dataset.<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image4.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb4.png\" alt=\"image\" width=\"644\" height=\"445\" border=\"0\" \/><\/a><\/p>\n<p>Such basic volumetrics were already possible by using only the original <em>TCAT<\/em> dataset, but what comes next is possible only because of our data preprocessing using the TCAT-Process scripts. By dragging Type or Hashtag onto the colour, for example, we\u2019re able to see what types of tweets or which hashtags were prominent at any one point (click to enlarge):<\/p>\n<p align=\"center\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image5.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb5.png\" alt=\"image\" width=\"304\" height=\"208\" border=\"0\" \/><\/a><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image6.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb6.png\" alt=\"image\" width=\"304\" height=\"209\" border=\"0\" \/><\/a><\/p>\n<p>Similarly, the preprocessing has extracted data on the major recipients of @mentions, and the major sources of URLs being shared. Here we can use the To User and Domain fields for colour coding, and for greater clarity I\u2019ve limited the graphs to the top 10 in each case, and filtered out any tweets which didn\u2019t mention another user or shared a URL, respectively:<\/p>\n<p align=\"center\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image7.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb7.png\" alt=\"image\" width=\"304\" height=\"209\" border=\"0\" \/><\/a><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image8.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb8.png\" alt=\"image\" width=\"304\" height=\"208\" border=\"0\" \/><\/a><\/p>\n<p>Such graphs can also be shown as a percentage of the total for each point in time. Note that in many cases the total will add up to more than 100%, as in my example below: my dataset was gathered largely based on the #libspill and #spill hashtags, but many of the tweets contained more than one hashtag.<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image9.png\"><img decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb9.png\" alt=\"image\" width=\"838\" border=\"0\" \/><\/a><\/p>\n<h3>Beyond Volumetrics<\/h3>\n<p>Many other combinations of datapoints are also possible \u2013 too many to step through here. However, some key analytical approaches beyond basic volumetrics are particularly obvious: for example, graphing From User Name or To User against CNTD(ID) and colouring by Type, we can see not only how many public tweets individual users sent or received, but also whether these were predominantly original tweets, @replies, or retweets, or a combination of them all \u2013 and by showing the two graphs alongside each other in a single Tableau dashboard, we can also create a useful comparative graph. Here, it\u2019s important to use our new field To User, not <em>TCAT<\/em>\u2019s and the <em>Twitter<\/em> API\u2019s To User Name, which is only very rarely set to any actually useful value.<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image10.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb10.png\" alt=\"image\" width=\"644\" height=\"392\" border=\"0\" \/><\/a><\/p>\n<p>(Note that Tableau\u2019s sort function seems to have some trouble sorting by CNTD(ID) \u2013 you may need to make some manual adjustments\u2026)<\/p>\n<p>It may also be useful to explore in which hashtag contexts specific users are @mentioned (for the purposes of illustration, here we\u2019ve removed the generic #libspill, #spill, and #auspol hashtags, and included only the top 10 remaining hashtags):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image11.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb11.png\" alt=\"image\" width=\"644\" height=\"319\" border=\"0\" \/><\/a><\/p>\n<p>Finally, there\u2019s plenty we can do with the resolved URLs from the tweets, too. For example, here are the most widely shared domains (including the first component of the URL path, to gain a little more insight into what the content may be), and the types of tweets they\u2019ve been shared in. One domain sticks out as being linked to only in original tweets \u2013 a clear indication of spam:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image12.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb12.png\" alt=\"image\" width=\"644\" height=\"440\" border=\"0\" \/><\/a><\/p>\n<p>If we filter for retweets only, we can see which domains were the major information sources during the event. Beyond <em>Twitter<\/em> itself (which mainly shows up because of widely shared images in tweets), we see that the sites offering liveblogs on the day (<em>Sydney Morning Herald<\/em>, <em>The Guardian<\/em>, <em>ABC News<\/em>) are especially prominent:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image13.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb13.png\" alt=\"image\" width=\"644\" height=\"441\" border=\"0\" \/><\/a><\/p>\n<p>And filtering the Domain field for links to <em>Twitter<\/em> content only, we can identify the most widely shared images (disseminated mostly through retweets, as we can see), as well as a set of what we might assume to be spam links (since <em>Twitter<\/em> itself flags them as potentially unsafe, and they were disseminated largely through hundreds of original tweets):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image14.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: block; float: none; margin-left: auto; margin-right: auto; padding-top: 0px; border: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/image_thumb14.png\" alt=\"image\" width=\"644\" height=\"449\" border=\"0\" \/><\/a><\/p>\n<p>Of course as always it is then possible to drill down further into the data, and identify exactly which images and articles were most salient for participants in the <em>Twitter<\/em> discussion \u2013 in Tableau, that\u2019s as easy as viewing the underlying data for any specific graph or datapoint.<\/p>\n<p>So much for a quick overview of the potential analytical approaches which the <em>TCAT<\/em> \u2013&gt; Gawk \u2013&gt; Tableau process makes possible. Perhaps a future version of <em>TCAT<\/em> might even enable a direct export of the three additional helper tables our approach has created, without a need for the additional processing work in Gawk \u2013 but for the moment, the TCAT-Process scripts package should help to further enhance existing <em>Twitter<\/em> data analytics approaches, I hope.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>In my previous post, I introduced a new set of Gawk scripts to extract a range of additional information from standard TCAT datafiles, in order to enable their use for data exploration, analysis, and visualisation in Tableau. After running the TCAT-Process scripts, we now have the following datafiles: datafile.csv \u2013 the original TCAT dataset (a &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Using Gawk to Prepare TCAT Data for Tableau, Part 2&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":2933,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[175,176,8],"tags":[286,7,297,284,291],"class_list":["post-2945","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-capture","category-processing","category-twitter","tag-data-processing","tag-gawk","tag-methods","tag-tableau","tag-tcat","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2945","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=2945"}],"version-history":[{"count":3,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2945\/revisions"}],"predecessor-version":[{"id":2953,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2945\/revisions\/2953"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/2933"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=2945"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=2945"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=2945"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}