Capture Processing Twitter — Snurb, 2 March 2015
Using Gawk to Prepare TCAT Data for Tableau, Part 1

Much of the research we’ve presented on this site over the years has built on yourTwapperkeeper, our trusty old tool for gathering Twitter data. But yTK isn’t the most modern of platforms any more, provides only a very limited user interface, and gathers only a fraction of the full metadata payload which the Twitter API delivers alongside the tweets themselves. More recently, therefore, we’ve increasingly switched our data gathering efforts to the Twitter Capture and Analysis Toolkit (TCAT), developed by the Digital Methods Initiative at the University of Amsterdam.

TCAT has its own limitations – for one, it only utilises the Twitter streaming API, while yTK also uses the search API for backfilling tweets in case of server outages; incidentally, a side effect of this is that it’s impossible to use URLs like abc.net.au as search terms in TCAT, since the streaming API doesn’t match search terms on resolved URLs, while the search API does. (Our ATNIX project depends on this functionality, and thus continues to rely on yTK.) But overall, TCAT provides a considerably more advanced solution for gathering Twitter data these days, and so we’ve had to develop new approaches to working with the datasets it generates.

Additionally, data analytics tools have also moved on. Excel was never particularly well suited to any serious analysis, and in the past we’ve done a great deal of data preprocessing using the command-line tool Gawk before loading the results into Excel for visualisation. These days, however, the software of choice is Tableau, whose functionality is several generations ahead of what Excel has to offer and which can do on the fly much of the analysis which we’ve had to painstakingly prepare during the preprocessing stage in the past. Tableau is far from inexpensive (although it’s free for current students, and educational pricing is available), but simply has very few credible alternatives at this point.

Connecting TCAT and Tableau, via Gawk

What’s still needed, then, is a straightforward process for loading TCAT data into Tableau. TCAT already has a number of built-in export functions, but these aren’t necessarily well-suited for our purposes, given the data format Tableau prefers. What’s worse, TCAT’s full data export format has changed over time, so the structure of its outputs isn’t necessarily very standardised. In this post, then, I’m outlining a standard method for preprocessing TCAT data specifically for Tableau, and below I’m also sharing the scripts that implement this method.

We’re building here on the TCAT function which exports all tweets in a dataset, subject to the specific parameters set by the user (timeframe, text and user filters, etc.):

image

This export results in a comma-separated values (CSV) file, which can already be imported into Tableau for analysis. However, what Tableau will struggle with is to extract some of the important information which isn’t yet in the metadata: for example, which users are mentioned in a tweet, whether a tweet is a retweet or @reply, which hashtags are present in a tweet, and which URLs are being shared. The problem is that most of these elements could occur multiple times in the same tweet: for example,

Hey @friend, have you seen this? #ohno RT @news: #BREAKING: World ends. http://t.co/abcdefg

is both an @reply (or @mention, of @friend) and a retweet (of @news), and contains the two hashtags #ohno and #BREAKING. In a simple CSV which contains only one row of data for each tweet, such multiple metadata points per tweet are very difficult to represent.

This is, of course, what relational databases are useful for, and Tableau provides the functionality to link multiple data sources (even in CSV or TSV form) in much the same way as if they were tables in a database. The easiest way to do this is based on the Twitter-generated ID of each tweet, which is unique and stable. Assuming the example above has the ID 1, for example, we could generate three additional tables to document the @mentions, hashtags, and URLs contained in the tweet:

ID

@mention type

target user

ID

hashtag

ID

URL

1

@reply

friend

1

#ohno

1

http://t.co/abcdefg

1

retweet

news

1

#BREAKING

Note that the @mentions and hashtags table now each have two entries relating to the same tweet, because it’s both an @reply and a retweet, and contains two different hashtags. Additionally, it would also be useful to resolve the t.co link to its eventual destination, and perhaps to specifically extract the URL’s domain name (so later on we can examine which sites are especially important in a dataset):

ID

Domain

Long URL

URL

1

http://cnn.com

http://cnn.com/worldends

http://t.co/abcdefg

These tables can then be linked with the main TCAT datafile, which itself constitutes a large table of tweets and metadata.

Introducing the TCAT-Process Scripts

Update: I’ve made further adjustments to these scripts to make them easier to use on a Mac. Please re-download the scripts package, and note the updated instructions below.

All of this is relatively easy to do using our old friend Gawk, a programmable command-line tool which is designed to process and filter CSV/TSV-format files. Below, I’m sharing a suite of Gawk scripts that are designed to work through the following steps:

  1. Take the TCAT datafile, extract tweet IDs and texts.   –>   datafile-tweettext.tab
  2. Identify @mention types and recipients.   –>   datafile-tweettypes.tab
  3. Identify hashtags in tweets.   –>   datafile-hashtags.tab
  4. Extract and resolve URLs in tweets.   –>   datafile-urlpaths.tab
    (This – optional – last step generates a number of temporary files for error checking, which can be deleted.)

The Gawk scripts automate the entire process – once everything is installed, a single Gawk call will do all the processing. Here are the scripts, as a downloadable ZIP archive, and here’s how to install them and set everything up to go:

  • Install Gawk and cURL.
    • Windows: Gawk (complete package – except sources); cURL (Win32 or Win64 – Generic, with SSL). Add Gawk and cURL are in your Windows command path.
    • Mac: install Macports, then run “sudo port install gawk” and “sudo port install curl +ssl” in a terminal.
  • Create a Data directory; inside the directory create a directory called _scripts, and one or more directories for your datasets.
  • Copy all of the the TCAT-Process Gawk scripts into the _scripts directory.
  • On a Mac, edit the tcat-process.awk script to change the path variable from “..\\_scripts\\” to “../scripts/”.

(Note: I’m not a Mac user, so the exact steps for Mac setup may vary – let me know if you run into any trouble.)

Once Gawk and cURL are installed, and the scripts are in place, download a dataset (using the full export function above) from your TCAT server, and save it into a dataset directory within your main Data directory. I’ll be using a #libspill dataset for my usage example, so I’m saving my TCAT file libspill.csv into D:\Data\libspill.

To process the TCAT dataset, open a command shell in the dataset folder (in my example, D:\Data\libspill), and enter the following Gawk command:

gawk -f ..\_scripts\tcat-process.awk file=datafile.csv

On a Mac, the file paths look slightly different, and you need to include an additional flag, mac=1:

gawk -f ../_scripts/tcat-process.awk file=datafile.csv mac=1

Press return when prompted, and wait until processing is complete – depending on the size of the dataset, this can take some time, mostly because the three passes of resolving the URLs contained in tweets make this a very slow process. However, the script generates the tweet types and hashtags tables first, so it’s possible already to work with the data in Tableau even if the URL resolving process has not yet completed.

Alternatively, to skip the URL processing altogether, add “nourls=1” to the gawk call:

gawk -f ..\_scripts\tcat-process.awk file=datafile.csv nourls=1

or (for Mac users):

gawk -f ../_scripts/tcat-process.awk file=datafile.csv nourls=1 mac=1

Ignoring the temporary output files from the URL resolving process, the end result should be four new files for use in Tableau (or three, if you’ve skipped the URL processing), in addition to the original TCAT datafile:

  • datafile-tweettext.tab (this can also be deleted – we don’t need it for Tableau)
  • datafile-tweettypes.tab
  • datafile-hashtags.tab
  • datafile-urlpaths.tab (unless nourls=1 was used)

I’ll describe how to link these datafiles in Tableau and use them for analysis in a separate post.

About the Author

Dr Axel Bruns leads the QUT Social Media Research Group. He is an ARC Future Fellow and Professor in the Creative Industries Faculty at Queensland University of Technology in Brisbane, Australia. Bruns is the author of Blogs, Wikipedia, Second Life and Beyond: From Production to Produsage (2008) and Gatewatching: Collaborative Online News Production (2005), and a co-editor of Twitter and Society, A Companion to New Media Dynamics and Uses of Blogs (2006). He is a Chief Investigator in the ARC Centre of Excellence for Creative Industries and Innovation. His research Website is at snurb.info, and he tweets as @snurb_dot_info.

Related Articles

Share

(5) Readers' Comments

  1. Thanks for writing this blog post and suggesting new features.

    Note that TCAT can also use the search API, but currently requires one to do so via the command line. We will implement a user interface for this in the coming weeks. This will also allow one to ‘backfill’ query bins and track URLs.

    As for using TCAT with Tableau, you could also simply connect Tableau to your TCAT database and then join the MySQL tables on tweet_id. This way you will not need to export CSV files and transform them with GAWK.

  2. Thanks Erik – great to see the search API functionality coming along. Look forward to seeing it in the Web interface !

    Yes, connecting to MySQL is also possible, but we often need to export and delete live archives from the database in order to keep database size manageable – and especially when collaborating with others, giving people access to the live database is often not an option. It would be better to have some direct export functionality in the TCAT system for the hashtags, mentions, and URLs tables.

    Also, it’s very important to distinguish types of mentions, and the mentions table doesn’t currently do that, I think ?

    Axel

  3. Pingback: Using Gawk to Prepare TCAT Data for Tableau, Part 2 | Mapping Online Publics

  4. Pingback: Metrics for Twitter Communities, Using TCAT and Tableau | Mapping Online Publics

  5. Pingback: Metrics for Analysing Twitter Communities, Using TCAT and Tableau | Mapping Online Publics