{"id":3254,"date":"2016-04-15T15:17:00","date_gmt":"2016-04-15T05:17:00","guid":{"rendered":"http:\/\/mappingonlinepublics.net\/?p=3254"},"modified":"2016-04-15T15:27:53","modified_gmt":"2016-04-15T05:27:53","slug":"twitter-analytics-using-tcat-and-tableau-via-gawk-and-bigquery","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2016\/04\/15\/twitter-analytics-using-tcat-and-tableau-via-gawk-and-bigquery\/","title":{"rendered":"Twitter Analytics Using TCAT and Tableau, via Gawk and BigQuery"},"content":{"rendered":"<p>I\u2019ve previously introduced <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">my TCAT-Process package<\/a> of helper scripts (written in Gawk), which take exports of <em>Twitter<\/em> data from the <em><a href=\"https:\/\/github.com\/digitalmethodsinitiative\/dmi-tcat\/wiki\">Twitter Capture and Analysis Toolkit (TCAT)<\/a><\/em>, developed by the Digital Methods Initiative at the University of Amsterdam, and convert them to a format that is best suited to using the data in the analytics software Tableau. This post is an update that provides the latest version of TCAT-Process, and outlines some of the alternative options that are now available through further developments in <em>TCAT<\/em> itself.<\/p>\n<p>But beyond these updates, the major new addition to this setup that I\u2019ll describe here is a process for uploading the data to Google BigQuery. BigQuery is a commercial but very affordable platform for working quickly with large datasets, and uploading the processed <em>Twitter<\/em> data to BigQuery can considerably speed up the analytics process \u2013 especially also because Tableau has the ability to connect natively to BigQuery databases. For large <em>Twitter<\/em> datasets \u2013 and when you\u2019re dealing with major global events, they can easily grow to the millions of tweets and generate gigabytes of data \u2013, this is a crucial advance which outsources a lot of the computationally intensive processing to Google\u2019s servers, rather than doing it locally on a desktop machine.<\/p>\n<p>None of this is entirely easy \u2013 partly also because recent changes to <em>TCAT<\/em>, although welcome where they\u2019ve added new functionality, have at times also changed the underlying data formats. But hopefully the following outline of steps will help you get your data from <em>TCAT <\/em>into Tableau without too many problems.<\/p>\n<p><strong>Please note:<\/strong> if your version of <em>TCAT<\/em> has options to export hashtag and mention tables, if your dataset is small enough for these exports to work reliably, and if you\u2019re not interested in resolving the URLs contained in your data, you won\u2019t need to use TCAT-Process at all. In that case, skip the section below that discusses TCAT-Process, and load your <em>TCAT<\/em> exports directly into BigQuery, following the steps described below.<\/p>\n<h3>Downloading <em>TCAT<\/em> Data<\/h3>\n<p>The first step in the process is necessarily to export the data from <em>TCAT<\/em>. Recent versions of <em>TCAT<\/em> have added an export option in tab-separated format (as opposed to comma-separated), and this is by far the preferred format for our further work.<\/p>\n<p>TCAT-Process is able to work with CSV files exported from <em>TCAT<\/em>, too, but if at all possible, choose TSV:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image1.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb1.png\" width=\"321\" height=\"116\" border=\"0\" \/><\/a><\/p>\n<p>Then, select the desired export parameters (dataset, timeframe, etc.), and <em>export all tweets<\/em> from your selection:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image2.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb2.png\" width=\"361\" height=\"77\" border=\"0\" \/><\/a><\/p>\n<p>If your version of <em>TCAT<\/em> is recent enough, you\u2019ll also have options to export hashtag and mentions tables. Depending on your server setup, these may not work very well for very large datasets (of a million tweets or more), and <em>TCAT<\/em> will time out or produce empty files if you try them \u2013 so export hashtags and mentions if you can, but if this doesn\u2019t work we\u2019ll use TCAT-Process to create them in a later step.<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image3.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb3.png\" width=\"636\" height=\"136\" border=\"0\" \/><\/a><\/p>\n<h3>Installing and Using TCAT-Process on the <em>TCAT<\/em> Data<\/h3>\n<p><em>(As noted, skip this section if you\u2019ve managed to export the full, hashtag, and mentions tables from TCAT and are not interested in resolving the URLs in your data.)<\/em><\/p>\n<p>First, download the latest version of TCAT-Process from here: <a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/tcat-process.zip\">tcat-process.zip<\/a>. This contains a directory named <em>_scripts<\/em> that should be placed in your central <em>Twitter<\/em> data folder, alongside the folders that your various datasets are stored in.<\/p>\n<p>If necessary, you also need to install Gawk and cURL, the tools which TCAT-Process relies on \u2013 see <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">my previous post<\/a> for more information on how to install these.<\/p>\n<p>I\u2019ve already outlined the inner workings of TCAT-Process <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-1\/\">in my previous post<\/a>, and won\u2019t repeat all of this here. However, there are a number of key additions to the processing options that are worth pointing out. Here is the full set of option switches that TCAT-Process now uses:<\/p>\n<ul>\n<li><strong>mac:<\/strong> file paths and other aspects work slightly different on Macs, compared to PCs. Set <em>mac=1<\/em> on a Mac, otherwise ignore.<\/li>\n<li><strong>path:<\/strong> by default, TCAT-Process expects its helper scripts to be located in a _scripts folder that sits next to the current working directory in the next higher folder (addressed as ..\/_scripts\/); if your scripts are located elsewhere, specify the path here. Make sure you include the trailing slash. On a PC, escape backslashes by adding a second backslash, e.g. <em>path=D:\\\\Twitter\\\\_scripts\\\\<\/em>.<\/li>\n<li><strong>file:<\/strong> the name of your full export file. Set file=<em>[filename.ext]<\/em>, e.g. file=Eurovision_2016-20160108-20160509&#8212;&#8212;&#8211;fullExport&#8211;6fa087d64b.tsv.<\/li>\n<li><strong>tcattype:<\/strong> the export format of your <em>TCAT<\/em> file. Set to either <em>tcattype=csv<\/em> or <em>tcattype=tsv<\/em>; defaults to <em>csv<\/em>.<\/li>\n<li><strong>nopreprocess:<\/strong> if for some reason you\u2019ve already preprocessed your export file, or if you\u2019re working with <em>Twitter<\/em> data from a source other than <em>TCAT<\/em>, set <em>preprocess=1<\/em>; otherwise ignore. If you do set <em>preprocess=1<\/em>, TCAT-Process expects an additional file named <em>[filename]-tweetexts.csv\/tsv<\/em> in its working directory, which contains a list of all tweets in the format <em>id,text<\/em>.<\/li>\n<li><strong>nohashtypes:<\/strong> if you\u2019ve successfully exported the hashtag and mention tables, you can keep TCAT-Process from extracting similar data from the full dataset. Set <em>nohashtypes=1<\/em> in that case; otherwise ignore.<\/li>\n<li><strong>nourls:<\/strong> by default, TCAT-Process will resolve all URLs in the dataset. This can be very time-consuming; set <em>nourls=1<\/em> to skip this step, or otherwise ignore.<\/li>\n<\/ul>\n<p>TCAT-Process will work best if you call the script from within the folder that contains your dataset. So, once you have downloaded your full dataset as <em>filename.tsv<\/em>, open a terminal window and make the dataset folder your current directory. To process the data file completely using TCAT-Process, use the following command:<\/p>\n<blockquote><p>gawk -f ..\\_scripts\\tcat-process.awk file=<em>filename.tsv<\/em> tcattype=tsv<\/p><\/blockquote>\n<p>If you\u2019ve successfully downloaded the hashtag and mention tables that TCAT generates, you can keep TCAT-Process from generating those tables again from the main dataset:<\/p>\n<blockquote><p>gawk -f ..\\_scripts\\tcat-process.awk file=<em>filename.tsv<\/em> tcattype=tsv nohashtypes=1<\/p><\/blockquote>\n<p>(In each case, use ..\/_scripts\/tcat-process.awk if you\u2019re on a Mac. I\u2019m assuming here that the _scripts folder sits within the directory that also contains the folder with your dataset.)<\/p>\n<p>Press return when prompted, and wait until TCAT-Process completes. This may take some time depending on the size of your dataset \u2013 for datasets of more than 100MB, it could stretch to hours or days. URL processing takes up the vast majority of this time, so use <em>nourls=1<\/em> if you\u2019re happy to skip that part. Also, while the URLs are still processing you can already get started on uploading the full, hashtags, and mentions datasets to BigQuery.<\/p>\n<h3>Uploading Your Data to BigQuery<\/h3>\n<p>First, <a href=\"https:\/\/cloud.google.com\/bigquery\/sign-up\">follow the instructions provided by Google<\/a> to set up a BigQuery account. Note that BigQuery isn\u2019t free \u2013 for very extensive datasets, you will eventually generate costs that would be charged to your credit card. That said, for the volume of <em>Twitter<\/em> data most of us will be working with, those costs are largely negligible, and will remain at a level of cents rather than dollars. See <a href=\"https:\/\/cloud.google.com\/bigquery\/pricing\">the pricing information provided by Google<\/a> for more details, and use the History tab on <a href=\"https:\/\/console.cloud.google.com\/billing\/\">the Billing page of the Google Cloud Platform<\/a> to keep track of your current costs.<\/p>\n<p>Once you are set up for Google BigQuery, we\u2019ll first need to create a new project to hold your data. Go to the <a href=\"https:\/\/console.cloud.google.com\/\">Google Cloud Console<\/a>, and from the pull-down menu in the header select the <em>Create a project\u2026<\/em> option. In the dialogue that pops up, choose a meaningful name for your project:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image4.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb4.png\" width=\"454\" height=\"257\" border=\"0\" \/><\/a><\/p>\n<p>Next, use the Dashboard option to create a Cloud Storage bucket. A \u2018bucket\u2019 is a container for your uploaded data files. Give this a meaningful name, e.g. <em>tcat_exports<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image5.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb5.png\" width=\"317\" height=\"171\" border=\"0\" \/><\/a><\/p>\n<p>You should now see an empty file list for your new bucket:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image6.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb6.png\" width=\"675\" height=\"129\" border=\"0\" \/><\/a><\/p>\n<p>To upload your data files to the Cloud Storage bucket, simply drag them onto the file list in your browser, and an upload progress indicator will appear. For each dataset, you should have up to four files to upload: your full data export from <em>TCAT<\/em> (filename <em>\u2026fullExport\u2026.tsv<\/em>), the mentions and hashtag tables from <em>TCAT<\/em> if you have them (filenames <em>\u2026mentionExport\u2026.tsv<\/em> and <em>\u2026hashtagExport\u2026.tsv<\/em>), and\/or the mentions, hashtags, and URL files you\u2019ve created using TCAT-Process (filenames <em>\u2026-tweettypes.tsv<\/em>, <em>\u2026-hashtags.tsv<\/em>, and <em>\u2026-urlpaths.tsv<\/em>). There is no need to upload any of the other files that TCAT-Process has created (and in fact you\u2019re welcome to delete these at this point). Depending on the size of your datasets, and on the quality of your network connection, the upload may take some time \u2013 keep your browser window open while it progresses:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image7.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb7.png\" width=\"475\" height=\"194\" border=\"0\" \/><\/a><\/p>\n<p>The uploaded files will appear in the bucket browser \u2013 and once they\u2019re all there, we\u2019re ready to import them into BigQuery. Use the Google Cloud Platform menu sidebar to go to <a href=\"https:\/\/bigquery.cloud.google.com\/\">Google BigQuery<\/a> (the menu has many options, so you may want to search for \u2018bigquery\u2019):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image8.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb8.png\" width=\"291\" height=\"144\" border=\"0\" \/><\/a><\/p>\n<p>The <em>Twitter<\/em> data project you\u2019ve created should already be preselected. In the left sidebar, click on the pull-down arrow next to your project title, and create a new dataset. Give this a meaningful name that describes the contents of your data (for my example below, I\u2019ll be using \u2018eurovision\u2019, since that\u2019s what my <em>TCAT<\/em> dataset is about):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image9.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb9.png\" width=\"474\" height=\"131\" border=\"0\" \/><\/a><\/p>\n<p>When the dataset appears under your project, mouse over it and click the <em>+<\/em> icon that appears to create a new data table:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image10.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb10.png\" width=\"456\" height=\"107\" border=\"0\" \/><\/a><\/p>\n<p>In the <em>Create Table<\/em> dialogue that now appears, choose the following settings (start with the full export file, and repeat this for each subsequent file):<\/p>\n<ul>\n<li>Location: choose <em>Google Cloud Storage<\/em>, and enter the name of your data bucket and the name of the uploaded file, as in the example below (I have named my storage bucket <em>tcat_exports<\/em>).<\/li>\n<li>File format: select <em>CSV<\/em> (we\u2019re usually working with TSVs, of course, but BigQuery doesn\u2019t distinguish between the two).<\/li>\n<li>Table name: stick with the preselected dataset name (in the example below that\u2019s <em>eurovision<\/em>), and enter an appropriate name for the new table (e.g. <em>fullexport<\/em>, <em>hashtagexport<\/em>, <em>mentionexport<\/em>, <em>tweettypes<\/em>, <em>hashtags<\/em>, <em>urlpaths<\/em>). Note that you can only use letters, numbers, and underscores in table names.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image11.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb11.png\" width=\"652\" height=\"246\" border=\"0\" \/><\/a><\/p>\n<p>Next, we need to define an import schema for the new table. This schema describes the table you\u2019re importing, and is therefore different for each table type. I\u2019ve created a number of presets that describe each type \u2013 download and unpack the following file: <a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/bq-import-schemas.zip\">bq-import-schemas.zip<\/a>. Use a text editor to open the file that describes the table you\u2019re importing:<\/p>\n<ul>\n<li><em>TCAT<\/em> full export: <em>TCAT BQ import schema (TCAT full export).txt<\/em><\/li>\n<li><em>TCAT<\/em> hashtags export: <em>TCAT BQ import schema (TCAT hashtags table).txt<\/em><\/li>\n<li><em>TCAT<\/em> mentions export: <em>TCAT BQ import schema (TCAT mentions table).txt<\/em><\/li>\n<li>TCAT-Process hashtags file: <em>TCAT BQ import schema (tcat-process hashtags table).txt<\/em><\/li>\n<li>TCAT-Process tweettypes file: <em>TCAT BQ import schema (tcat-process tweettypes table).txt<\/em><\/li>\n<li>TCAT-Process urlpaths file: <em>TCAT BQ import schema (tcat-process urlpaths table).txt<\/em><\/li>\n<\/ul>\n<p>Select the entire schema as shown in your text editor, and copy it to your clipboard. Back in BigQuery, below the Schema settings, click on <em>Edit as Text<\/em> and paste the schema into the textbox that appears:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image12.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb12.png\" width=\"663\" height=\"110\" border=\"0\" \/><\/a><\/p>\n<p>Finally, in the Options settings choose <em>Comma<\/em> or <em>Tab<\/em> depending on the format of your datafiles, set <em>Header rows to skip<\/em> to 1, and tick <em>Allow quoted newlines<\/em> and <em>Allow jagged rows<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image13.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb13.png\" width=\"516\" height=\"213\" border=\"0\" \/><\/a><\/p>\n<p>When you\u2019re all set, click <em>Create Table<\/em> and wait until the process completes. BigQuery will automatically switch to the Job History list, which shows the progress of your current data import job; click on the job to see more details. Once the job is complete, the icon to the left of the job will turn green.<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image14.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border-width: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb14.png\" width=\"642\" height=\"40\" border=\"0\" \/><\/a><\/p>\n<p>If it turns red to indicate an error, check for what went wrong and click <em>Repeat Load Job<\/em>. In case there are any problems with your data files, you might want to adjust the <em>Number of errors<\/em> allowed setting in Options (say to a value of 10), or tick <em>Ignore unknown values<\/em>. (Such errors may be caused by particularly unusual characters in tweet texts: I\u2019ve noticed that in its current version, the character combination \\&#8221; is incorrectly encoded by <em>TCAT<\/em> as it exports the data, and confuses BigQuery as it parses the source file. But these errors should be very rare.)<\/p>\n<p>If there is a persistent problem with importing your full export file from <em>TCAT<\/em>, you may be running an older version of <em>TCAT<\/em> which exports a different number of data fields, in a different order. In that case, try the import schema in <em>TCAT BQ import schema (TCAT full export, 35 fields).txt<\/em>, rather than the schema in <em>TCAT BQ import schema (TCAT full export).txt<\/em>. (To check which version you are running, export a small dataset from <em>TCAT<\/em> and open the export file in Excel or a text editor. The older <em>TCAT<\/em> version will export 35 fields, the later version exports 36 fields. (There is also a much older version which exported only 22 fields, but hopefully most people will have moved on from this by now.)<\/p>\n<p>Repeat these import steps \u2013 with the appropriate import schema in each case \u2013 for each of your data files, so that in the end you have up to four BigQuery tables (e.g. <em>fullexport<\/em>, <em>tweettypes<\/em>, <em>hashtags<\/em>, <em>urlpaths<\/em> \u2013 or, if you\u2019re using the original <em>TCAT<\/em> exports of mentions and hashtags, <em>fullexport<\/em>, <em>mentionexport<\/em>, <em>hashtagexport<\/em>, <em>urlpaths<\/em>).<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image15.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb15.png\" width=\"269\" height=\"168\" border=\"0\" \/><\/a><\/p>\n<p>Note that if your source data are split across multiple data files \u2013 for instance because you\u2019ve exported your dataset from <em>TCAT<\/em> every week or every month \u2013 you can also import multiple data files of the same type into the same table: the new data will simply be appended to the table, and the order in which we import the data into the table doesn\u2019t matter because we\u2019re accessing the content through Tableau anyway. So, for instance, if you have <em>TCAT<\/em>\u00a0<em>\u2026fullExport\u2026<\/em> files for April, May, and June, use the approach described above for each of these files, but direct them all to a single <em>fullexport<\/em> table. Then, load the corresponding <em>tweettypes<\/em> files for the April, May, and June exports all into the same <em>tweettypes<\/em> table on BigQuery \u2013 and do the same for <em>hashtags<\/em> and <em>urlpaths<\/em>. You could also repeat this process for every future month, and so gradually add new data to your dataset as you continue your data-gathering activities.<\/p>\n<p>At the conclusion of this process, all of our data are now in BigQuery and ready for use \u2013 which means that it\u2019s time to switch to Tableau and connect to our data from there. Alternatively, of course, you could also query your datasets by using BigQuery\u2019s own SQL query tools, or any other software that connects to BigQuery<\/p>\n<h3>Accessing the Data through Tableau<\/h3>\n<p>Connecting to the data in Tableau is now very straightforward. Open Tableau, and select <em>Google BigQuery<\/em> from the Connect menu:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image16.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb16.png\" width=\"536\" height=\"467\" border=\"0\" \/><\/a><\/p>\n<p>Now select your Project and Dataset \u2013 in my example, that\u2019s <em>TCAT Data<\/em> and <em>eurovision<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image17.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb17.png\" width=\"303\" height=\"269\" border=\"0\" \/><\/a><\/p>\n<p>Drag the table that contains your full export data onto the <em>Drag tables here<\/em> canvas first \u2013 I\u2019ve named it <em>fullexport<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image18.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb18.png\" width=\"300\" height=\"99\" border=\"0\" \/><\/a><\/p>\n<p>Now drop your other tables onto the same canvas. If you\u2019ve processed everything using TCAT-Process, the results should look like this, and Tableau will already have guessed how you want to link these tables (it\u2019s not quite correct in its guess, but we\u2019ll address this in a second step):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image19.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb19.png\" width=\"547\" height=\"168\" border=\"0\" \/><\/a><\/p>\n<p>For each of the linked tables, click on the overlapping circles and choose <em>Left Join<\/em> rather than <em>Inner Join<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image20.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb20.png\" width=\"539\" height=\"166\" border=\"0\" \/><\/a><\/p>\n<p>The end result should look as follows (though of course you may have chosen different table names, and there may not be a <em>urlpaths<\/em> table if you\u2019ve chosen not to resolve URLs):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image21.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb21.png\" width=\"536\" height=\"166\" border=\"0\" \/><\/a><\/p>\n<p>Alternatively, if you\u2019re using the <em>hashtagexport<\/em> and <em>mentionsexport<\/em> tables from <em>TCAT<\/em> itself, Tableau will want you to select how these should be linked. Choose a <em>Left Join<\/em>, and link the tables on <em>Id = Tweet Id<\/em> in each case:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image22.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb22.png\" width=\"536\" height=\"174\" border=\"0\" \/><\/a><\/p>\n<p>Then add the <em>urlpaths<\/em> table (if you\u2019ve resolved the URLs), and make sure it\u2019s connected with the full export table as a <em>Left Join<\/em> as well. The end result should look like this:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image23.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb23.png\" width=\"539\" height=\"172\" border=\"0\" \/><\/a><\/p>\n<p>If you\u2019re familiar with using Tableau to analyse the data in offline CSV and TSV files, you may be tempted to also create a data extract at this stage. This is not necessary when we\u2019re working with BigQuery, because all of the data processing happens on Google\u2019s servers. Creating an extract on the client side would in fact slow down your data processing considerably, especially when working with very large datasets \u2013 so stick with a <em>Live<\/em> connection:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image24.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb24.png\" width=\"168\" height=\"54\" border=\"0\" \/><\/a><\/p>\n<p>When you\u2019ve completed this setup, click Sheet 1 to go to a blank worksheet:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image25.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; margin: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb25.png\" width=\"281\" height=\"112\" border=\"0\" \/><\/a><\/p>\n<p>You\u2019re now almost ready to analyse your data as you see fit. Two more setup steps will be useful, though: first, in analysing tweet volumes you\u2019ll usually want to count the number of distinct tweet IDs, rather than simply the number of records in your dataset, as this provides the most accurate count. So, in the sidebar on the left, drag <em>Id<\/em> from Dimensions to Measures, where it will automatically be set to use Count (Distinct):<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image26.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb26.png\" width=\"208\" height=\"169\" border=\"0\" \/><\/a><\/p>\n<p>(To trigger this behaviour, our BigQuery import schema declared Id to be a string rather than a number, as you may have noticed earlier. While in principle, tweet IDs <em>are<\/em> numbers, of course, in our analysis we\u2019ll never actually use them as such \u2013 what would be the point of calculating the sum or average of a bunch of IDs, after all?)<\/p>\n<p>Second, the <em>Time<\/em> field contains the timestamp for each tweet, but is set to UTC by default (it\u2019s calculated automatically by BigQuery from a numerical Unix timestamp). To adjust this to the local time most appropriate for your further analysis, create a calculated field that adds or subtracts a number of hours from Time. Hover your mouse pointer over the <em>Time<\/em> field, click the down arrow that appears, and select <em>Create &gt; Calculated Field\u2026<\/em>:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image27.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; margin: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb27.png\" width=\"521\" height=\"161\" border=\"0\" \/><\/a><\/p>\n<p>Give the field a name (for instance the code for the timezone you\u2019re shifting to), and enter the following formula:<\/p>\n<blockquote><p>DATEADD(&#8216;hour&#8217;,10,[Time])<\/p><\/blockquote>\n<p>In my example, to shift UTC to Australian Eastern Standard Time (AEST, or UTC+10), this is the formula to use:<\/p>\n<p><a href=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image28.png\"><img loading=\"lazy\" decoding=\"async\" style=\"background-image: none; padding-top: 0px; padding-left: 0px; display: inline; padding-right: 0px; border: 0px;\" title=\"image\" alt=\"image\" src=\"https:\/\/mappingonlinepublics.net\/dev\/wp-content\/uploads\/2016\/04\/image_thumb28.png\" width=\"243\" height=\"93\" border=\"0\" \/><\/a><\/p>\n<p>Click OK, and at the bottom of the Dimensions sidebar you should now find a new field with the name you\u2019ve given it (e.g. <em>AEST<\/em>). Use this rather than <em>Time<\/em> itself for any further time-based analyses.<\/p>\n<p>And with this final step, you\u2019re ready to analyse your data, and benefit from the vastly improved analysis speeds that Google BigQuery offers, especially when you\u2019re dealing with very large datasets. Some ideas for the range of analytical approaches you might want to pursue are outlined in my previous posts on using Tableau to analyse <em>TCAT<\/em> data \u2013 so perhaps start <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/02\/using-gawk-to-prepare-tcat-data-for-tableau-part-2\/\">here<\/a> or <a href=\"http:\/\/mappingonlinepublics.net\/2015\/03\/31\/metrics-for-analysing-twitter-communities-using-tcat-and-tableau\/\">here<\/a>?<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>I\u2019ve previously introduced my TCAT-Process package of helper scripts (written in Gawk), which take exports of Twitter data from the Twitter Capture and Analysis Toolkit (TCAT), developed by the Digital Methods Initiative at the University of Amsterdam, and convert them to a format that is best suited to using the data in the analytics software &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2016\/04\/15\/twitter-analytics-using-tcat-and-tableau-via-gawk-and-bigquery\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Twitter Analytics Using TCAT and Tableau, via Gawk and BigQuery&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":3243,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[176,8],"tags":[305,286,7,284,291,298],"class_list":["post-3254","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-processing","category-twitter","tag-bigquery","tag-data-processing","tag-gawk","tag-tableau","tag-tcat","tag-twitter","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/3254","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=3254"}],"version-history":[{"count":2,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/3254\/revisions"}],"predecessor-version":[{"id":3256,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/3254\/revisions\/3256"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/3243"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=3254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=3254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=3254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}