{"id":2912,"date":"2015-03-02T09:00:52","date_gmt":"2015-03-01T23:00:52","guid":{"rendered":"http:\/\/mappingonlinepublics.net\/?p=2912"},"modified":"2015-03-31T12:02:54","modified_gmt":"2015-03-31T02:02:54","slug":"using-gawk-to-prepare-tcat-data-for-tableau-part-1","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/","title":{"rendered":"Using Gawk to Prepare TCAT Data for Tableau, Part 1"},"content":{"rendered":"<p>Much of the research we\u2019ve presented on this site over the years has built on <em>yourTwapperkeeper<\/em>, our trusty old tool for gathering <em>Twitter<\/em> data. But <em>yTK<\/em> isn\u2019t the most modern of platforms any more, provides only a very limited user interface, and gathers only a fraction of the full metadata payload which the <em>Twitter<\/em> API delivers alongside the tweets themselves. More recently, therefore, we\u2019ve increasingly switched our data gathering efforts to the <em><a href=\"https:\/\/github.com\/digitalmethodsinitiative\/dmi-tcat\/wiki\">Twitter Capture and Analysis Toolkit (TCAT)<\/a><\/em>, developed by the Digital Methods Initiative at the University of Amsterdam.<\/p>\n<p><em>TCAT<\/em> has its own limitations \u2013 for one, it only utilises the <em>Twitter<\/em> streaming API, while <em>yTK<\/em> also uses the search API for backfilling tweets in case of server outages; incidentally, a side effect of this is that it\u2019s impossible to use URLs like <em>abc.net.au<\/em> as search terms in <em>TCAT<\/em>, since the streaming API doesn\u2019t match search terms on resolved URLs, while the search API does. (Our ATNIX project depends on this functionality, and thus continues to rely on <em>yTK<\/em>.) But overall, <em>TCAT<\/em> provides a considerably more advanced solution for gathering <em>Twitter<\/em> data these days, and so we\u2019ve had to develop new approaches to working with the datasets it generates.<\/p>\n<p>Additionally, data analytics tools have also moved on. Excel was never particularly well suited to any serious analysis, and in the past we\u2019ve done a great deal of data preprocessing using the command-line tool Gawk before loading the results into Excel for visualisation. These days, however, the software of choice is <a href=\"http:\/\/www.tableau.com\/\">Tableau<\/a>, whose functionality is several generations ahead of what Excel has to offer and which can do on the fly much of the analysis which we\u2019ve had to painstakingly prepare during the preprocessing stage in the past. Tableau is far from inexpensive (although it\u2019s free for current students, and educational pricing is available), but simply has very few credible alternatives at this point.<\/p>\n<h3>Connecting TCAT and Tableau, via Gawk<\/h3>\n<p>What\u2019s still needed, then, is a straightforward process for loading <em>TCAT<\/em> data into Tableau. <em>TCAT<\/em> already has a number of built-in export functions, but these aren\u2019t necessarily well-suited for our purposes, given the data format Tableau prefers. What\u2019s worse, <em>TCAT<\/em>\u2019s full data export format has changed over time, so the structure of its outputs isn\u2019t necessarily very standardised. In this post, then, I\u2019m outlining a standard method for preprocessing <em>TCAT<\/em> data specifically for Tableau, and below I\u2019m also sharing the scripts that implement this method.<\/p>\n<p>We\u2019re building here on the <em>TCAT<\/em> function which exports all tweets in a dataset, subject to the specific parameters set by the user (timeframe, text and user filters, etc.):<\/p>\n<p align=\"center\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/02\/image7.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-left: 0px; padding-right: 0px; display: inline; padding-top: 0px; border-width: 0px;\" title=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/02\/image_thumb7.png\" alt=\"image\" width=\"403\" height=\"86\" border=\"0\" \/><\/a><\/p>\n<p>This export results in a comma-separated values (CSV) file, which can already be imported into Tableau for analysis. However, what Tableau will struggle with is to extract some of the important information which isn\u2019t yet in the metadata: for example, which users are mentioned in a tweet, whether a tweet is a retweet or @reply, which hashtags are present in a tweet, and which URLs are being shared. The problem is that most of these elements could occur multiple times in the same tweet: for example,<\/p>\n<blockquote><p>Hey @friend, have you seen this? #ohno RT @news: #BREAKING: World ends. http:\/\/t.co\/abcdefg<\/p><\/blockquote>\n<p>is both an @reply (or @mention, of @friend) and a retweet (of @news), and contains the two hashtags #ohno and #BREAKING. In a simple CSV which contains only one row of data for each tweet, such multiple metadata points per tweet are very difficult to represent.<\/p>\n<p>This is, of course, what relational databases are useful for, and Tableau provides the functionality to link multiple data sources (even in CSV or TSV form) in much the same way as if they were tables in a database. The easiest way to do this is based on the <em>Twitter<\/em>-generated ID of each tweet, which is unique and stable. Assuming the example above has the ID 1, for example, we could generate three additional tables to document the @mentions, hashtags, and URLs contained in the tweet:<\/p>\n<table class=\"MsoNormalTable\" style=\"mso-yfti-tbllook: 1184; mso-padding-alt: 1.5pt 1.5pt 1.5pt 1.5pt; mso-cellspacing: 0cm;\" border=\"0\" width=\"531\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr style=\"mso-yfti-irow: 0; mso-yfti-firstrow: yes;\">\n<td style=\"background: #c0504d; mso-background-themecolor: accent2; padding: 1.5pt;\" valign=\"top\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">ID<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"background: #c0504d; mso-background-themecolor: accent2; padding: 1.5pt;\" valign=\"top\" width=\"115\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">@mention type<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"background: #c0504d; mso-background-themecolor: accent2; padding: 1.5pt;\" valign=\"top\" width=\"75\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">target user<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"background: #4f81bd; mso-background-themecolor: accent1; padding: 1.5pt;\" valign=\"top\" width=\"28\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">ID<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"background: #4f81bd; mso-background-themecolor: accent1; padding: 1.5pt;\" valign=\"top\" width=\"85\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">hashtag<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"background: #f79646; mso-background-themecolor: accent6; padding: 1.5pt;\" valign=\"top\" width=\"28\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">ID<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"background: #f79646; mso-background-themecolor: accent6; padding: 1.5pt;\" valign=\"top\" width=\"132\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">URL<\/span><\/span><\/b><\/p>\n<\/td>\n<\/tr>\n<tr style=\"mso-yfti-irow: 1;\">\n<td style=\"border-bottom: medium none; border-left: #c0504d 1pt solid; border-top: #c0504d 1pt solid; border-right: medium none; mso-border-top-themecolor: accent2; mso-border-left-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; mso-border-left-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/p>\n<\/td>\n<td style=\"border-bottom: medium none; border-left: medium none; border-top: #c0504d 1pt solid; border-right: medium none; mso-border-top-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\" width=\"115\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">@reply<\/span><\/p>\n<\/td>\n<td style=\"border-bottom: medium none; border-left: medium none; border-top: #c0504d 1pt solid; border-right: #c0504d 1pt solid; mso-border-top-themecolor: accent2; mso-border-right-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; mso-border-right-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\" width=\"75\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">friend<\/span><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"mso-border-alt: solid #4f81bd .5pt; mso-border-themecolor: accent1; border: #4f81bd 1pt solid; padding: 1.5pt;\" valign=\"top\" width=\"28\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/p>\n<\/td>\n<td style=\"mso-border-alt: solid #4f81bd .5pt; mso-border-themecolor: accent1; border: #4f81bd 1pt solid; padding: 1.5pt;\" valign=\"top\" width=\"85\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">#ohno<\/span><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"border-bottom: #f79646 1pt solid; border-left: #f79646 1pt solid; border-top: #f79646 1pt solid; border-right: medium none; mso-border-themecolor: accent6; mso-border-top-themecolor: accent6; mso-border-left-themecolor: accent6; mso-border-bottom-themecolor: accent6; mso-border-top-alt: solid #f79646 .5pt; mso-border-left-alt: solid #f79646 .5pt; mso-border-bottom-alt: solid #f79646 .5pt; padding: 1.5pt;\" valign=\"top\" width=\"28\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/p>\n<\/td>\n<td style=\"border: #f79646 1pt solid; padding: 1.5pt;\" valign=\"top\" width=\"132\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">http:\/\/t.co\/abcdefg<\/span><\/p>\n<\/td>\n<\/tr>\n<tr style=\"mso-yfti-irow: 2; mso-yfti-lastrow: yes;\">\n<td style=\"border-bottom: #c0504d 1pt solid; border-left: #c0504d 1pt solid; border-top: #c0504d 1pt solid; border-right: medium none; mso-border-themecolor: accent2; mso-border-top-themecolor: accent2; mso-border-left-themecolor: accent2; mso-border-bottom-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; mso-border-left-alt: solid #c0504d .5pt; mso-border-bottom-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/p>\n<\/td>\n<td style=\"border-bottom: #c0504d 1pt solid; border-left: medium none; border-top: #c0504d 1pt solid; border-right: medium none; mso-border-top-themecolor: accent2; mso-border-bottom-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; mso-border-bottom-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\" width=\"115\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">retweet<\/span><\/p>\n<\/td>\n<td style=\"border-bottom: #c0504d 1pt solid; border-left: medium none; border-top: #c0504d 1pt solid; border-right: #c0504d 1pt solid; mso-border-themecolor: accent2; mso-border-top-themecolor: accent2; mso-border-right-themecolor: accent2; mso-border-bottom-themecolor: accent2; mso-border-top-alt: solid #c0504d .5pt; mso-border-right-alt: solid #c0504d .5pt; mso-border-bottom-alt: solid #c0504d .5pt; padding: 1.5pt;\" valign=\"top\" width=\"75\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">news<\/span><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"mso-border-alt: solid #4f81bd .5pt; mso-border-themecolor: accent1; border: #4f81bd 1pt solid; padding: 1.5pt;\" valign=\"top\" width=\"28\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/p>\n<\/td>\n<td style=\"mso-border-alt: solid #4f81bd .5pt; mso-border-themecolor: accent1; border: #4f81bd 1pt solid; padding: 1.5pt;\" valign=\"top\" width=\"85\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">#BREAKING<\/span><\/p>\n<\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"18\"><\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"28\"><\/td>\n<td style=\"padding: 1.5pt;\" valign=\"top\" width=\"132\"><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Note that the @mentions and hashtags table now each have two entries relating to the same tweet, because it\u2019s both an @reply and a retweet, and contains two different hashtags. Additionally, it would also be useful to resolve the <em>t.co<\/em> link to its eventual destination, and perhaps to specifically extract the URL\u2019s domain name (so later on we can examine which sites are especially important in a dataset):<\/p>\n<table class=\"MsoTableLightListAccent6\" style=\"border-collapse: collapse; mso-yfti-tbllook: 1184; mso-padding-alt: 0cm 5.4pt 0cm 5.4pt; mso-border-alt: solid #f79646 1.0pt; mso-border-themecolor: accent6;\" border=\"1\" width=\"448\" cellspacing=\"0\" cellpadding=\"0\">\n<tbody>\n<tr style=\"mso-yfti-irow: -1; mso-yfti-firstrow: yes;\">\n<td style=\"border-bottom: medium none; border-left: #f79646 1pt solid; background: #f79646; border-top: #f79646 1pt solid; border-right: medium none; mso-background-themecolor: accent6; mso-border-top-themecolor: accent6; mso-border-left-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"27\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 5;\"><a name=\"OLE_LINK27\"><\/a><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">ID<\/span><\/span><\/b><\/p>\n<\/td>\n<td style=\"border-bottom: medium none; border-left: medium none; background: #f79646; border-top: #f79646 1pt solid; border-right: medium none; mso-background-themecolor: accent6; mso-border-top-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 1;\"><span style=\"mso-bookmark: ole_link27;\"><b style=\"mso-bidi-font-weight: normal;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">Domain<\/span><\/span><\/b><\/span><\/p>\n<\/td>\n<td style=\"border-bottom: medium none; border-left: medium none; background: #f79646; border-top: #f79646 1pt solid; border-right: medium none; mso-background-themecolor: accent6; mso-border-top-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 1;\"><span style=\"mso-bookmark: ole_link27;\"><b style=\"mso-bidi-font-weight: normal;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">Long URL<\/span><\/span><\/b><\/span><\/p>\n<\/td>\n<td style=\"border-bottom: medium none; border-left: medium none; background: #f79646; border-top: #f79646 1pt solid; border-right: #f79646 1pt solid; mso-background-themecolor: accent6; mso-border-top-themecolor: accent6; mso-border-right-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 1;\"><span style=\"mso-bookmark: ole_link27;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au; mso-themecolor: background1;\"><span style=\"color: #ffffff;\">URL<\/span><\/span><\/b><\/span><\/p>\n<\/td>\n<\/tr>\n<tr style=\"mso-yfti-irow: 0; mso-yfti-lastrow: yes;\">\n<td style=\"border-bottom: #f79646 1pt solid; border-left: #f79646 1pt solid; border-top: #f79646 1pt solid; border-right: medium none; mso-border-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"27\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 68;\"><span style=\"mso-bookmark: ole_link27;\"><b><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">1<\/span><\/b><\/span><\/p>\n<\/td>\n<td style=\"border-bottom: #f79646 1pt solid; border-left: medium none; border-top: #f79646 1pt solid; border-right: medium none; mso-border-top-themecolor: accent6; mso-border-bottom-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 64;\"><span style=\"mso-bookmark: ole_link27;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">http:\/\/cnn.com<\/span><\/span><\/p>\n<\/td>\n<td style=\"border-bottom: #f79646 1pt solid; border-left: medium none; border-top: #f79646 1pt solid; border-right: medium none; mso-border-top-themecolor: accent6; mso-border-bottom-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 64;\"><span style=\"mso-bookmark: ole_link27;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">http:\/\/cnn.com\/worldends<\/span><\/span><\/p>\n<\/td>\n<td style=\"border-bottom: #f79646 1pt solid; border-left: medium none; border-top: #f79646 1pt solid; border-right: #f79646 1pt solid; mso-border-themecolor: accent6; padding: 0cm 5.4pt 0cm 5.4pt;\" valign=\"top\" width=\"140\">\n<p class=\"MsoNormal\" style=\"line-height: normal; margin: 0cm 0cm 0pt; mso-yfti-cnfc: 64;\"><span style=\"mso-bookmark: ole_link27;\"><span style=\"mso-ascii-font-family: calibri; mso-fareast-font-family: 'Times New Roman'; mso-hansi-font-family: calibri; mso-bidi-font-family: 'Times New Roman'; mso-fareast-language: en-au;\">http:\/\/t.co\/abcdefg<\/span><\/span><\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>These tables can then be linked with the main <em>TCAT<\/em> datafile, which itself constitutes a large table of tweets and metadata.<\/p>\n<h3>Introducing the TCAT-Process Scripts<\/h3>\n<p><span style=\"color: #ff0000;\"><strong>Update:<\/strong> I&#8217;ve made further adjustments to these scripts to make them easier to use on a Mac. Please re-download the scripts package, and note the updated instructions below.<\/span><\/p>\n<p>All of this is relatively easy to do using our old friend Gawk, a programmable command-line tool which is designed to process and filter CSV\/TSV-format files. Below, I\u2019m sharing a suite of Gawk scripts that are designed to work through the following steps:<\/p>\n<ol>\n<li>Take the <em>TCAT<\/em> datafile, extract tweet IDs and texts.\u00a0\u00a0 \u2013&gt;\u00a0\u00a0 <em>datafile<\/em>-tweettext.tab<\/li>\n<li>Identify @mention types and recipients.\u00a0\u00a0 \u2013&gt;\u00a0\u00a0 <em>datafile<\/em>-tweettypes.tab<\/li>\n<li>Identify hashtags in tweets.\u00a0\u00a0 \u2013&gt;\u00a0\u00a0 <em>datafile<\/em>-hashtags.tab<\/li>\n<li>Extract and resolve URLs in tweets.\u00a0\u00a0 \u2013&gt;\u00a0\u00a0 <em>datafile<\/em>-urlpaths.tab<br \/>\n(This \u2013 optional \u2013 last step generates a number of temporary files for error checking, which can be deleted.)<\/li>\n<\/ol>\n<p>The Gawk scripts automate the entire process \u2013 once everything is installed, a single Gawk call will do all the processing. <a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2015\/03\/tcat-process.zip\">Here are the scripts, as a downloadable ZIP archive<\/a>, and here\u2019s how to install them and set everything up to go:<\/p>\n<ul>\n<li>Install Gawk and cURL.\n<ul>\n<li>Windows: <a href=\"HTTP:\/\/GNUWIN32.SOURCEFORGE.NET\/PACKAGES\/GAWK.HTM\">Gawk<\/a> (complete package \u2013 except sources); <a href=\"HTTP:\/\/CURL.HAXX.SE\/DOWNLOAD.HTML\">cURL<\/a> (Win32 or Win64 \u2013 Generic, with SSL). Add Gawk and cURL are in your Windows command path.<\/li>\n<li>Mac: install <a href=\"HTTP:\/\/WWW.MACPORTS.ORG\/\">Macports<\/a>, then run \u201csudo port install gawk\u201d and \u201csudo port install curl +ssl\u201d in a terminal.<\/li>\n<\/ul>\n<\/li>\n<li>Create a <em>Data<\/em> directory; inside the directory create a directory called <em>_scripts<\/em>, and one or more directories for your datasets.<\/li>\n<li>Copy all of the the TCAT-Process Gawk scripts into the <em>_scripts<\/em> directory.<\/li>\n<li><del>On a Mac, edit the tcat-process.awk script to change the path variable from \u201c..\\\\_scripts\\\\\u201d to \u201c..\/scripts\/\u201d.<\/del><\/li>\n<\/ul>\n<p><em>(Note: I\u2019m not a Mac user, so the exact steps for Mac setup may vary \u2013 let me know if you run into any trouble.)<\/em><\/p>\n<p>Once Gawk and cURL are installed, and the scripts are in place, download a dataset (using the full export function above) from your <em>TCAT<\/em> server, and save it into a dataset directory within your main <em>Data<\/em> directory. I\u2019ll be using a #libspill dataset for my usage example, so I\u2019m saving my <em>TCAT<\/em> file libspill.csv into D:\\Data\\libspill.<\/p>\n<p>To process the <em>TCAT<\/em> dataset, open a command shell in the dataset folder (in my example, D:\\Data\\libspill), and enter the following Gawk command:<\/p>\n<blockquote><p>gawk -f ..\\_scripts\\tcat-process.awk file=<em>datafile<\/em>.csv<\/p><\/blockquote>\n<p><span style=\"color: #ff0000;\">On a Mac, the file paths look slightly different, and you need to include an additional flag,\u00a0<em>mac=1<\/em>:<\/span><\/p>\n<blockquote><p><span style=\"color: #ff0000;\">gawk -f ..\/_scripts\/tcat-process.awk file=<em>datafile<\/em>.csv mac=1<\/span><\/p><\/blockquote>\n<p>Press return when prompted, and wait until processing is complete \u2013 depending on the size of the dataset, this can take some time, mostly because the three passes of resolving the URLs contained in tweets make this a very slow process. However, the script generates the tweet types and hashtags tables first, so it\u2019s possible already to work with the data in Tableau even if the URL resolving process has not yet completed.<\/p>\n<p>Alternatively, to skip the URL processing altogether, add \u201cnourls=1\u201d to the gawk call:<\/p>\n<blockquote><p>gawk -f ..\\_scripts\\tcat-process.awk file=<em>datafile<\/em>.csv nourls=1<\/p><\/blockquote>\n<p><span style=\"color: #ff0000;\">or (for Mac users):<\/span><\/p>\n<blockquote><p><span style=\"color: #ff0000;\">gawk -f ..\/_scripts\/tcat-process.awk file=<em>datafile<\/em>.csv nourls=1 mac=1<\/span><\/p><\/blockquote>\n<p>Ignoring the temporary output files from the URL resolving process, the end result should be four new files for use in Tableau (or three, if you\u2019ve skipped the URL processing), in addition to the original <em>TCAT<\/em> datafile:<\/p>\n<ul>\n<li><em>datafile<\/em>-tweettext.tab (this can also be deleted \u2013 we don\u2019t need it for Tableau)<\/li>\n<li><em>datafile<\/em>-tweettypes.tab<\/li>\n<li><em>datafile<\/em>-hashtags.tab<\/li>\n<li><em>datafile<\/em>-urlpaths.tab (unless nourls=1 was used)<\/li>\n<\/ul>\n<p>I\u2019ll describe how to link these datafiles in Tableau and use them for analysis <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/\">in a separate post<\/a>.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>Much of the research we\u2019ve presented on this site over the years has built on yourTwapperkeeper, our trusty old tool for gathering Twitter data. But yTK isn\u2019t the most modern of platforms any more, provides only a very limited user interface, and gathers only a fraction of the full metadata payload which the Twitter API &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Using Gawk to Prepare TCAT Data for Tableau, Part 1&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":2918,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[175,176,8],"tags":[286,7,297,284,291],"class_list":["post-2912","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-capture","category-processing","category-twitter","tag-data-processing","tag-gawk","tag-methods","tag-tableau","tag-tcat","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2912","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=2912"}],"version-history":[{"count":11,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2912\/revisions"}],"predecessor-version":[{"id":2965,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2912\/revisions\/2965"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/2918"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=2912"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=2912"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=2912"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}