Methods Processing Tools Twitter — Snurb, 22 June 2011

Well, getting stuck in Melbourne for a day and being unable to participate in day one of our ATN-DAAD workshop with Cornelius Puschmann and Katrin Weller from the University of Düsseldorf has at least enabled me to put the finishing touches on something I’ve been meaning to do for some time: to collect and share the various Gawk scripts for processing Twitter data collected by Twapperkeeper or our modified yourTwapperkeeper. A ZIP file of all our (half-way decent) scripts is now available on the Tools section of our site.

These scripts enable the processing of comma- or tab-separated value files containing tweets related to specific hashtags or keywords, as Twapperkeeper used to produce them, and as yourTwapperkeeper does once you’ve installed the modified export functions which I shared in a previous post.

There are too many scripts here to provide instructions for individually; some of them I have already discussed in previous posts over the past year or so, however, and all of them contain some instructions on how to use them in their headers. To see those instructions, simply open the scripts in a text editor of your choice. I have also included with the ZIP file a brief list of all the scripts made available here, with a short description of what they do. Eventually, I may write up some more detailed instructions, but certainly not now.

The latest version of this scripts package will always be available from the Tools section of this Website.

The Readme file included with the ZIP also provides some general instructions – here are the most important ones:

Installation

All scripts should be placed in a central directory which is easily accessible from the command line interface. The urlresolve.awk script requires the open source tool wget to be installed and in the command path; it also needs to be modified so that the ‘path’ variable points to the directory containing the scripts (relative or absolute paths are acceptable; relative paths must be relative to the location that scripts are intended to be excecuted from). Paths must conform to standard PC, Mac, or Linux notation as appropriate; special characters (e.g. backslash) need to be escaped.

Additionally, of course, it is assumed that Gawk is installed and in the command path.

Usage

A brief overview of the scripts and their respective functions is provided in the Quick Guide file included in this archive.

Generally, scripts should be called as follows:

gawk -F , -f [script].awk [argument]="[parameters]" input.csv >output.csv
(for Twapperkeeper files in comma-separated value format)

or

gawk -F \t -f [script].awk [argument]="[parameters]" input.tsv >output.tsv
(for Twapperkeeper files in tab-separated value format)

Some scripts do not take any command line arguments; some may take multiple. All scripts are able to process both comma- and tab-separated value formats (CSV/TSV), and will usually return their results in the same format.

The exceptions from this rule are atextractfromtoonly.awk, preparegexfattimeintervals.awk, and gexfattimeintervals.awk: the first two output CSV only, while the latter generates a GEXF file. This is necessary since the network visualisation tool Gephi only processes CSV or GEXF formats.

Known Issues

The Mac version of Gawk has not implemented the ‘switch’ statement (sigh); atreplycount.awk and  multicount.awk will not work, therefore. Mac Gawk can be recompiled to include ‘switch'; search the Web for instructions on how to do so. There will be a workaround in a future revision of these scripts, replacing ‘switch’ with ‘if/then’ constructions.

If used with the ‘stats’ command line argument, atreplycount.awk will produce a division by zero error if any of the usernames specified with the ‘search’ command line argument did not tweet and/or receive @replies/RTs in the dataset being processed. Remove these usernames from the command line argument if the problem occurs. A fix will be made available in a future revision of these scripts.

Licence and Acknowledgement

All scripts are provided as is, with no guarantee of accuracy or reliability. While all efforts have been made to ensure the reliability of these scripts, no warranty is given or implied.

If you publish research which was conducted using these scripts, please acknowledge this. The script package can be cited as follows (modify for other bibliographic referencing schemes as required):

Axel Bruns and Jean Burgess. "Gawk Scripts for Twitter Processing." v1.0. Mapping Online Publics, 22 June 2011. <http://mappingonlinepublics.net/http://mappingonlinepublics.net/resources/>.

About the Author

Dr Axel Bruns leads the QUT Social Media Research Group. He is an ARC Future Fellow and Professor in the Creative Industries Faculty at Queensland University of Technology in Brisbane, Australia. Bruns is the author of Blogs, Wikipedia, Second Life and Beyond: From Production to Produsage (2008) and Gatewatching: Collaborative Online News Production (2005), and a co-editor of Twitter and Society, A Companion to New Media Dynamics and Uses of Blogs (2006). He is a Chief Investigator in the ARC Centre of Excellence for Creative Industries and Innovation. His research Website is at snurb.info, and he tweets as @snurb_dot_info.

Related Articles

Share

(2) Readers' Comments

  1. and again: thanks alot for sharing! :)

  2. Pingback: Mapping Online Publics » Blog Archive » Resolving Short URLs: A New Approach