Our work with Twitter data builds on a number of tools. Many posts on the blog describe how we’re using them. Here are our key tools:

Data Gathering

yourTwapperkeeper – an open source platform building on the popular Twapperkeeper Web service. Both capture tweets containing particular hashtags or keywords.

We have made some further extensions and modifications to the yourTwapperkeeper platform in order to ensure compatibility between TK and yTK datasets and to be able to export data in comma- and tab-separated formats. These modifications are described here; the modified yTK PHP scripts are available here: (9.4 kB) – v1.0, released 20 June 2011

Data Processing

Gawk – an open source, multiplatform, programmable command-line tool for processing CSV/TSV documents; essential for manipulating the datasets produced by our gathering tools.

We have developed a number of Gawk scripts for processing Twitter datasets in Twapperkeeper format. Many of the individual scripts are discussed on the blog; the current collection can be downloaded here: (23.7 kB) – v1.0, released 22 June 2011

Textual Analysis

Leximancer – commercial, multiplatform textual analysis tool: extracts key concepts from large corpora of text, examines and visualises concept co-occurrence

WordStat – commercial, PC-only textual analysis tool; part of a larger text statistics package: similar to but more powerful than Leximancer, and generates concept co-occurrence data that can be exported in standard formats for subsequent visualisation


Gephi – open source, multiplatform network visualisation tool: wide range of visualisation options, extensible plugin system, exports maps as PDF or SVG

Wordle – simple word cloud visualisation tool

Seadragon – handy tool for embedding large-scale images on a Web page; handles images, PDFs, SVGs, even URLs for Web pages…