Capture Methods Tools Twitter — Snurb, 21 June 2011
Switching from Twapperkeeper to yourTwapperkeeper

As those of you who are regular followers of our research might have gleaned already, we’ve started a few months ago to use yourTwapperkeeper to gather our Twitter data. yourTwapperkeeper is the open source version of the software that runs Twapperkeeper.com, which was one of the best tools for gathering Twitter data on selected #hashtags and keywords; sadly, Twitter’s move to a significantly stricter interpretation of its terms an conditions has made using Twapperkeeper itself all but impossible now. (I won’t go into the details of that discussion here – the key issue is that the public Twapperkeeper Website enabled researchers to share the datasets they’d gathered using the site, which Twitter took exception to.)

Happily, yourTwapperkeeper is a perfectly workable replacement for Twapperkeeper itself – but requires researchers to run their own instance of the tool on their own Web servers, and should not be used for the public sharing of datasets. yourTwapperkeeper is available from the project’s Website at Google Projects; as we’ve found, it also requires a few additional modifications before it can be used as a straight replacement to Twapperkeeper itself, however. In this post, I’m outlining the changes we’ve made – and I’m including the added and revised PHP files which are required for making them.

Out of the box, yourTwapperkeeper works almost as well as Twapperkeeper itself – and in some respects, better: it enables users to automatically archive tweets containing selected keywords or #hashtags from the Twitter streaming API, and (as it’s run on the user’s own server) it’s less likely to miss tweets by coming up against capacity limits than the shared, public Twapperkeeper service itself. So far, so good. What it doesn’t provide for, on the other hand, is CSV and TSV export functionality as we need it for our existing Gawk scripts to work; nor do its available export formats match those we’re used to from Twapperkeeper itself. Also, those built-in export functions have some severe problems with exporting very large datasets (>1m tweets).

Our revised and added PHP scripts (available for download here) take care of this. First, we’re replacing archive.php – mainly, to add links to two new export options, for comma- and tab-separated value files (CSV and TSV). In the process, we’re also changing the way that yTK displays the tweets it has captured on screen: it will now never display more than 100 tweets from an archive on screen – enough to check that the content looks fine, but not enough to really slow down the process of switching to a very large export (previously, if you wanted to export upwards of 100,000 tweets, for example, you needed to choose that filter option, and wait until they’d all been displayed on screen before being able to export them).

Further, we’re adding csv.php and tsv.php, the two scripts which take care of exporting data in those formats – using a modified export function which won’t choke on very large datasets, as some of the other export scripts do. As part of the export, they engage in some minor data cleaning: first, they remove any linebreaks or other problematic characters in the tweets themselves – for TSV export, TABs are converted to spaces; for CSV export, commas are replaced with the ‘,’ character, which looks virtually identical, but doesn’t break Gawk or Excel when we try to process the resulting file. We’re also dropping the redundant ‘archivesource’ column from the export, to ensure that the column format matches that of Twapperkeeper exports. And, importantly (especially if you’re dealing with multi-lingual datasets), our exports now keep the original tweets’ UTF-8 character set intact!

To modify your own installation of yTK, just copy those three scripts into the main directory (replacing the built-in archive.php); when you now go to your yTK site and look at one of your keyword archives, you’ll see the new export functions become available. Exporting an archive is simple – select what you want to export (timeframe, number of tweets, etc. – and you’ll usually want to make sure tweets are listed in ascending chronological order, too!), then right-click on the CSV or TSV export link and choose ‘Save As…’. CSV exports can be saved as filename.csv, TSV exports as filename.tsv – or (preferably – we’ll get to that) just save either as filename.txt.

If you’re using Gawk scripts to process your data, you’re now ready to work with those files (using the “gawk -F ,” or “gawk -f \t” options to specify CSV or TSV format as required, of course). Loading into Excel – while preserving the character sets – is a little more tricky. If your data doesn’t contain unusual characters, or if you’re not worried about losing them, you can just load the CSV or TSV files directly. Otherwise, follow these steps:

  1. Make sure your data file has a .txt ending; rename it if necessary.
  2. Open Excel.
  3. Choose ‘Open…’, and find the .txt file you want to open (you may need to tell Excel to display files of all types in the open dialogue).
  4. Excel will now present an import dialogue. Choose the following options (they may be named slightly differently in different Excel versions):
    1. Data type: “Delimited”.
    2. File origin: “65001: Unicode (UTF-8)”. Go to the next page.
    3. Delimiters: choose “Tab” or “Comma” as appropriate.
    4. Text qualifier: “{none}”. Go to the next page, then click “Finish”.

 
That should do it: your data should now have been imported into Excel, with the character set preserved – if you had any Chinese, Japanese, or Arabic characters in the data, for example, they should still be readable in Excel. (As you save your imported file, though, make sure you save in an Excel format (.xls / .xlsx), not in CSV or TSV, or otherwise those characters may still get lost during the save.)

Finally, a word on making sure your yTK installation keeps running: normally, this shouldn’t be a problem, but if your server goes down for any reason (for example for regularly scheduled backups or other maintenance), yTK will not come up again automatically – resulting in gaps in the data. In the first place, this means being vigilant and checking regularly that the yTK processes are still running, of course – but there’s also a simple way to automatically start yTK as the server restarts, by running a modified version of the startarchiving.php script automatically after the server reboots. I won’t go into details about this here, as it depends on your server infrastructure – but my summary of how to do this is available from the yTK discussion group.

About the Author

Dr Axel Bruns leads the QUT Social Media Research Group. He is an ARC Future Fellow and Professor in the Creative Industries Faculty at Queensland University of Technology in Brisbane, Australia. Bruns is the author of Blogs, Wikipedia, Second Life and Beyond: From Production to Produsage (2008) and Gatewatching: Collaborative Online News Production (2005), and a co-editor of Twitter and Society, A Companion to New Media Dynamics and Uses of Blogs (2006). He is a Chief Investigator in the ARC Centre of Excellence for Creative Industries and Innovation. His research Website is at snurb.info, and he tweets as @snurb_dot_info.

Related Articles

Share

(5) Readers' Comments

  1. thanks for sharing!

  2. Hi Gabriel,

    No worries. Keep an eye out for our Gawk scripts, too – coming later today…

    Axel

  3. Pingback: Mapping Online Publics » Blog Archive » Gawk Scripts for Processing Twitter Data, Vol. 1

  4. Pingback: Mapping Online Publics » Blog Archive » Taking Twitter Metrics to a New Level (Part 1)

  5. Pingback: Mapping Online Publics » Blog Archive » Twapperkeeper and Beyond: A Reminder