I’ve previously introduced my TCAT-Process package of helper scripts (written in Gawk), which take exports of Twitter data from the Twitter Capture and Analysis Toolkit (TCAT), developed by the Digital Methods Initiative at the University of Amsterdam, and convert them to a format that is best suited to using the data in the analytics software Tableau. This post is an update that provides the latest version of TCAT-Process, and outlines some of the alternative options that are now available through further developments in TCAT itself.
But beyond these updates, the major new addition to this setup that I’ll describe here is a process for uploading the data to Google BigQuery. BigQuery is a commercial but very affordable platform for working quickly with large datasets, and uploading the processed Twitter data to BigQuery can considerably speed up the analytics process – especially also because Tableau has the ability to connect natively to BigQuery databases. For large Twitter datasets – and when you’re dealing with major global events, they can easily grow to the millions of tweets and generate gigabytes of data –, this is a crucial advance which outsources a lot of the computationally intensive processing to Google’s servers, rather than doing it locally on a desktop machine.
None of this is entirely easy – partly also because recent changes to TCAT, although welcome where they’ve added new functionality, have at times also changed the underlying data formats. But hopefully the following outline of steps will help you get your data from TCAT into Tableau without too many problems.
Please note: if your version of TCAT has options to export hashtag and mention tables, if your dataset is small enough for these exports to work reliably, and if you’re not interested in resolving the URLs contained in your data, you won’t need to use TCAT-Process at all. In that case, skip the section below that discusses TCAT-Process, and load your TCAT exports directly into BigQuery, following the steps described below.
Downloading TCAT Data
The first step in the process is necessarily to export the data from TCAT. Recent versions of TCAT have added an export option in tab-separated format (as opposed to comma-separated), and this is by far the preferred format for our further work.
TCAT-Process is able to work with CSV files exported from TCAT, too, but if at all possible, choose TSV:
Then, select the desired export parameters (dataset, timeframe, etc.), and export all tweets from your selection:
If your version of TCAT is recent enough, you’ll also have options to export hashtag and mentions tables. Depending on your server setup, these may not work very well for very large datasets (of a million tweets or more), and TCAT will time out or produce empty files if you try them – so export hashtags and mentions if you can, but if this doesn’t work we’ll use TCAT-Process to create them in a later step.
Installing and Using TCAT-Process on the TCAT Data
(As noted, skip this section if you’ve managed to export the full, hashtag, and mentions tables from TCAT and are not interested in resolving the URLs in your data.)
First, download the latest version of TCAT-Process from here: tcat-process.zip. This contains a directory named _scripts that should be placed in your central Twitter data folder, alongside the folders that your various datasets are stored in.
If necessary, you also need to install Gawk and cURL, the tools which TCAT-Process relies on – see my previous post for more information on how to install these.
I’ve already outlined the inner workings of TCAT-Process in my previous post, and won’t repeat all of this here. However, there are a number of key additions to the processing options that are worth pointing out. Here is the full set of option switches that TCAT-Process now uses:
- mac: file paths and other aspects work slightly different on Macs, compared to PCs. Set mac=1 on a Mac, otherwise ignore.
- path: by default, TCAT-Process expects its helper scripts to be located in a _scripts folder that sits next to the current working directory in the next higher folder (addressed as ../_scripts/); if your scripts are located elsewhere, specify the path here. Make sure you include the trailing slash. On a PC, escape backslashes by adding a second backslash, e.g. path=D:\\Twitter\\_scripts\\.
- file: the name of your full export file. Set file=[filename.ext], e.g. file=Eurovision_2016-20160108-20160509——–fullExport–6fa087d64b.tsv.
- tcattype: the export format of your TCAT file. Set to either tcattype=csv or tcattype=tsv; defaults to csv.
- nopreprocess: if for some reason you’ve already preprocessed your export file, or if you’re working with Twitter data from a source other than TCAT, set preprocess=1; otherwise ignore. If you do set preprocess=1, TCAT-Process expects an additional file named [filename]-tweetexts.csv/tsv in its working directory, which contains a list of all tweets in the format id,text.
- nohashtypes: if you’ve successfully exported the hashtag and mention tables, you can keep TCAT-Process from extracting similar data from the full dataset. Set nohashtypes=1 in that case; otherwise ignore.
- nourls: by default, TCAT-Process will resolve all URLs in the dataset. This can be very time-consuming; set nourls=1 to skip this step, or otherwise ignore.
TCAT-Process will work best if you call the script from within the folder that contains your dataset. So, once you have downloaded your full dataset as filename.tsv, open a terminal window and make the dataset folder your current directory. To process the data file completely using TCAT-Process, use the following command:
gawk -f ..\_scripts\tcat-process.awk file=filename.tsv tcattype=tsv
If you’ve successfully downloaded the hashtag and mention tables that TCAT generates, you can keep TCAT-Process from generating those tables again from the main dataset:
gawk -f ..\_scripts\tcat-process.awk file=filename.tsv tcattype=tsv nohashtypes=1
(In each case, use ../_scripts/tcat-process.awk if you’re on a Mac. I’m assuming here that the _scripts folder sits within the directory that also contains the folder with your dataset.)
Press return when prompted, and wait until TCAT-Process completes. This may take some time depending on the size of your dataset – for datasets of more than 100MB, it could stretch to hours or days. URL processing takes up the vast majority of this time, so use nourls=1 if you’re happy to skip that part. Also, while the URLs are still processing you can already get started on uploading the full, hashtags, and mentions datasets to BigQuery.
Uploading Your Data to BigQuery
First, follow the instructions provided by Google to set up a BigQuery account. Note that BigQuery isn’t free – for very extensive datasets, you will eventually generate costs that would be charged to your credit card. That said, for the volume of Twitter data most of us will be working with, those costs are largely negligible, and will remain at a level of cents rather than dollars. See the pricing information provided by Google for more details, and use the History tab on the Billing page of the Google Cloud Platform to keep track of your current costs.
Once you are set up for Google BigQuery, we’ll first need to create a new project to hold your data. Go to the Google Cloud Console, and from the pull-down menu in the header select the Create a project… option. In the dialogue that pops up, choose a meaningful name for your project:
Next, use the Dashboard option to create a Cloud Storage bucket. A ‘bucket’ is a container for your uploaded data files. Give this a meaningful name, e.g. tcat_exports:
You should now see an empty file list for your new bucket:
To upload your data files to the Cloud Storage bucket, simply drag them onto the file list in your browser, and an upload progress indicator will appear. For each dataset, you should have up to four files to upload: your full data export from TCAT (filename …fullExport….tsv), the mentions and hashtag tables from TCAT if you have them (filenames …mentionExport….tsv and …hashtagExport….tsv), and/or the mentions, hashtags, and URL files you’ve created using TCAT-Process (filenames …-tweettypes.tsv, …-hashtags.tsv, and …-urlpaths.tsv). There is no need to upload any of the other files that TCAT-Process has created (and in fact you’re welcome to delete these at this point). Depending on the size of your datasets, and on the quality of your network connection, the upload may take some time – keep your browser window open while it progresses:
The uploaded files will appear in the bucket browser – and once they’re all there, we’re ready to import them into BigQuery. Use the Google Cloud Platform menu sidebar to go to Google BigQuery (the menu has many options, so you may want to search for ‘bigquery’):
The Twitter data project you’ve created should already be preselected. In the left sidebar, click on the pull-down arrow next to your project title, and create a new dataset. Give this a meaningful name that describes the contents of your data (for my example below, I’ll be using ‘eurovision’, since that’s what my TCAT dataset is about):
When the dataset appears under your project, mouse over it and click the + icon that appears to create a new data table:
In the Create Table dialogue that now appears, choose the following settings (start with the full export file, and repeat this for each subsequent file):
- Location: choose Google Cloud Storage, and enter the name of your data bucket and the name of the uploaded file, as in the example below (I have named my storage bucket tcat_exports).
- File format: select CSV (we’re usually working with TSVs, of course, but BigQuery doesn’t distinguish between the two).
- Table name: stick with the preselected dataset name (in the example below that’s eurovision), and enter an appropriate name for the new table (e.g. fullexport, hashtagexport, mentionexport, tweettypes, hashtags, urlpaths). Note that you can only use letters, numbers, and underscores in table names.
Next, we need to define an import schema for the new table. This schema describes the table you’re importing, and is therefore different for each table type. I’ve created a number of presets that describe each type – download and unpack the following file: bq-import-schemas.zip. Use a text editor to open the file that describes the table you’re importing:
- TCAT full export: TCAT BQ import schema (TCAT full export).txt
- TCAT hashtags export: TCAT BQ import schema (TCAT hashtags table).txt
- TCAT mentions export: TCAT BQ import schema (TCAT mentions table).txt
- TCAT-Process hashtags file: TCAT BQ import schema (tcat-process hashtags table).txt
- TCAT-Process tweettypes file: TCAT BQ import schema (tcat-process tweettypes table).txt
- TCAT-Process urlpaths file: TCAT BQ import schema (tcat-process urlpaths table).txt
Select the entire schema as shown in your text editor, and copy it to your clipboard. Back in BigQuery, below the Schema settings, click on Edit as Text and paste the schema into the textbox that appears:
Finally, in the Options settings choose Comma or Tab depending on the format of your datafiles, set Header rows to skip to 1, and tick Allow quoted newlines and Allow jagged rows:
When you’re all set, click Create Table and wait until the process completes. BigQuery will automatically switch to the Job History list, which shows the progress of your current data import job; click on the job to see more details. Once the job is complete, the icon to the left of the job will turn green.
If it turns red to indicate an error, check for what went wrong and click Repeat Load Job. In case there are any problems with your data files, you might want to adjust the Number of errors allowed setting in Options (say to a value of 10), or tick Ignore unknown values. (Such errors may be caused by particularly unusual characters in tweet texts: I’ve noticed that in its current version, the character combination \” is incorrectly encoded by TCAT as it exports the data, and confuses BigQuery as it parses the source file. But these errors should be very rare.)
If there is a persistent problem with importing your full export file from TCAT, you may be running an older version of TCAT which exports a different number of data fields, in a different order. In that case, try the import schema in TCAT BQ import schema (TCAT full export, 35 fields).txt, rather than the schema in TCAT BQ import schema (TCAT full export).txt. (To check which version you are running, export a small dataset from TCAT and open the export file in Excel or a text editor. The older TCAT version will export 35 fields, the later version exports 36 fields. (There is also a much older version which exported only 22 fields, but hopefully most people will have moved on from this by now.)
Repeat these import steps – with the appropriate import schema in each case – for each of your data files, so that in the end you have up to four BigQuery tables (e.g. fullexport, tweettypes, hashtags, urlpaths – or, if you’re using the original TCAT exports of mentions and hashtags, fullexport, mentionexport, hashtagexport, urlpaths).
Note that if your source data are split across multiple data files – for instance because you’ve exported your dataset from TCAT every week or every month – you can also import multiple data files of the same type into the same table: the new data will simply be appended to the table, and the order in which we import the data into the table doesn’t matter because we’re accessing the content through Tableau anyway. So, for instance, if you have TCAT …fullExport… files for April, May, and June, use the approach described above for each of these files, but direct them all to a single fullexport table. Then, load the corresponding tweettypes files for the April, May, and June exports all into the same tweettypes table on BigQuery – and do the same for hashtags and urlpaths. You could also repeat this process for every future month, and so gradually add new data to your dataset as you continue your data-gathering activities.
At the conclusion of this process, all of our data are now in BigQuery and ready for use – which means that it’s time to switch to Tableau and connect to our data from there. Alternatively, of course, you could also query your datasets by using BigQuery’s own SQL query tools, or any other software that connects to BigQuery
Accessing the Data through Tableau
Connecting to the data in Tableau is now very straightforward. Open Tableau, and select Google BigQuery from the Connect menu:
Now select your Project and Dataset – in my example, that’s TCAT Data and eurovision:
Drag the table that contains your full export data onto the Drag tables here canvas first – I’ve named it fullexport:
Now drop your other tables onto the same canvas. If you’ve processed everything using TCAT-Process, the results should look like this, and Tableau will already have guessed how you want to link these tables (it’s not quite correct in its guess, but we’ll address this in a second step):
For each of the linked tables, click on the overlapping circles and choose Left Join rather than Inner Join:
The end result should look as follows (though of course you may have chosen different table names, and there may not be a urlpaths table if you’ve chosen not to resolve URLs):
Alternatively, if you’re using the hashtagexport and mentionsexport tables from TCAT itself, Tableau will want you to select how these should be linked. Choose a Left Join, and link the tables on Id = Tweet Id in each case:
Then add the urlpaths table (if you’ve resolved the URLs), and make sure it’s connected with the full export table as a Left Join as well. The end result should look like this:
If you’re familiar with using Tableau to analyse the data in offline CSV and TSV files, you may be tempted to also create a data extract at this stage. This is not necessary when we’re working with BigQuery, because all of the data processing happens on Google’s servers. Creating an extract on the client side would in fact slow down your data processing considerably, especially when working with very large datasets – so stick with a Live connection:
When you’ve completed this setup, click Sheet 1 to go to a blank worksheet:
You’re now almost ready to analyse your data as you see fit. Two more setup steps will be useful, though: first, in analysing tweet volumes you’ll usually want to count the number of distinct tweet IDs, rather than simply the number of records in your dataset, as this provides the most accurate count. So, in the sidebar on the left, drag Id from Dimensions to Measures, where it will automatically be set to use Count (Distinct):
(To trigger this behaviour, our BigQuery import schema declared Id to be a string rather than a number, as you may have noticed earlier. While in principle, tweet IDs are numbers, of course, in our analysis we’ll never actually use them as such – what would be the point of calculating the sum or average of a bunch of IDs, after all?)
Second, the Time field contains the timestamp for each tweet, but is set to UTC by default (it’s calculated automatically by BigQuery from a numerical Unix timestamp). To adjust this to the local time most appropriate for your further analysis, create a calculated field that adds or subtracts a number of hours from Time. Hover your mouse pointer over the Time field, click the down arrow that appears, and select Create > Calculated Field…:
Give the field a name (for instance the code for the timezone you’re shifting to), and enter the following formula:
In my example, to shift UTC to Australian Eastern Standard Time (AEST, or UTC+10), this is the formula to use:
Click OK, and at the bottom of the Dimensions sidebar you should now find a new field with the name you’ve given it (e.g. AEST). Use this rather than Time itself for any further time-based analyses.
And with this final step, you’re ready to analyse your data, and benefit from the vastly improved analysis speeds that Google BigQuery offers, especially when you’re dealing with very large datasets. Some ideas for the range of analytical approaches you might want to pursue are outlined in my previous posts on using Tableau to analyse TCAT data – so perhaps start here or here?