Twitter Concept Mapping with Wordstat and Gephi: First Steps

Continuing my series of posts on methods for doing quantitative research using Twitter data, this will be a fairly tentative post. I’m currently looking into ways to examine the terms and concepts used by tweeters as they discuss specific issues; we’ve done similar work looking at the content of blog-based debates in the past, using the (commercial) concept mapping software Leximancer, but I’ve never been fully satisfied with the information generated by Leximancer, and especially with its data visualisation functionality, so it’s time to look at the alternatives.

Ideally, I’d like to leave the visualisation aspects to the open source software Gephi, which I’ve already used for some useful network visualisations (more on that in another post), so what I’m really after is a software that produces word and concept co-occurrence data for my source texts (in this case, a database of tweets on a specific subject), and pushes this out in a format that Gephi can understand (e.g. UCINet or Pajek, or even Gephi‘s own network data format). At the ICA conference in Singapore last month, I came across a (commercial, sadly) quantitative text analysis software called WordStat – part of a larger software package available from Provalis Research that includes various other statistical tools which are less relevant for me here -, so that’s where I’ll start.

(I could probably develop a Gawk script that tabulates word co-occurrences in a database of tweets, based on the simple word frequency counting script I used for my last post here, but then I’d also have to develop the functions to remove any meaningless words – ‘and’, ‘or’, etc. -, and that wheel has been invented elsewhere already. That said – I’d love to hear about open source alternatives to WordStat if anybody knows of any!)

So, let’s work through this problem. I’m starting, again, with a Twapperkeeper archive of tweets tagged with the ‘#debate’ hashtag, and from this I’ve selected a 24-hour slice covering Sunday 25 July 2010 – the day of the televised leaders’ debate in the 2010 Australian federal election (I’ve already posted a basic word frequency analysis for tweets during the 5 p.m. to midnight timeframe that day in another post). I’ve converted the Twapperkeeper comma-separated values file into an Excel file for import into Provalis’s QDA Miner software (which doesn’t deal too well with Twapperkeeper CSV files, it seems), and imported it by creating a new project from the existing Excel data file.

Update:

Turns out the best way to import Twapperkeeper data into QDA Miner / WordStat is in tab-separated value format (.tsv), rather than Excel or CSV. So, here’s a quick Gawk script that turns CSVs into TSVs:


      
# c2t.awk - convert comma-separated to tab-separated
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN { 
	OFS = "\t"
}

{
	output = $1
	for(i=2; i<=NF; i++) {
		output = output OFS $i
	}
	print output
}

(End update.)

For the purposes of this first exploration, I don't really need to do anything much in QDA Miner itself - it's simply a means to prepare my data for WordStat, which cannot be run separately. So, for now, in QDA Miner I'm simply choosing the 'Analyse > Content Analysis' menu option, and select the 'text' field which holds the content of the tweets themselves. Clicking through the dialogue window that follows, WordStat eventually opens.

The first step for dealing with my tweet data is to remove some common but meaningless terms (standard English words such as 'and' and 'or', etc.), to condense all plurals to singular ('boat' and 'boats' are treated as equal), to remove shortened verbs (e.g. 'I've') etc. Thankfully, WordStat has a few handy functions for this - on the Dictionaries tab, I've switched on English exclusions and English lemmatisation to get rid of most of these words. (There's also the 'Porter Stemmer', which goes even further by reducing words to their stem - for example, 'debate', 'debates', and 'debating' all become 'debat' - but that's probably overkill for our purposes.) Next, on the Options tab, I've selected 'Disk' as my working space (WordStat seems to have a habit of running out of memory with larger datasets...), and on the 'Speller / Thesaurus' subtab I also deselected 'Ignore words in uppercase' so that 'shouted' tweets remain in the data (other than that, WordStat is case-insensitive).

Clicking on the Frequencies tab finally starts the analysis - which may take a while, depending on the dataset. The first step is to change the order of the resulting keyword list to 'Frequency', to see a first result. Here, I'm removing a few further terms - 'debate' (tweets were selected because of the hashtag '#debate', so of course the word is present in 100% of all cases), 'ausvotes', and 'RT', to begin with - by right-clicking on them and choosing 'to exclusion list'. Selecting the first 30-odd terms and using the in-built chart function produces a handy little graph:

Keywords are nice - but multi-word phrases are even nicer. So, on to the 'Phrase Finder' tab: this lists the most frequently occurring combinations of keywords (no chart function this time, but selecting multiple rows and right-clicking on them allows us to copy the table, at least):

	FREQUENCY
FAIR DINKUM	667
TONY ABBOTT	556
JULIA GILLARD	406
MOVE FORWARD	273
MASTERCHEF CALL IT MASTERBATE	191
BOB BROWN	185
CLIMATE CHANGE	182
INTERNET FILTER	181
BIGGEST WANKER	181
BIGGEST WANKER WIN	178
MASTERBATE AND MAY THE BIGGEST	177
STOP THE BOAT	165
AUSTRALIANS LET A WELL WOMAN	128
WOMAN AND AN ENGLISH MAN	128
SCARED OF IMMIGRANT	128
PARENTAL LEAVE	109
PRIME MINISTER	99
TONY ABBOT	98
MALE WORM	94
BOAT PEOPLE	94
KEVIN RUDD	88
EAR LOBE	87
ANSWER THE QUESTION	86
FEMALE WORM	84
PAID PARENTAL LEAVE	83
ASYLUM SEEKER	81
BLUE WORM	77
POSITION ON THE INTERNET FILTER	74
LEADER TO CLARIFY THEIR POSITION	73
BLAH BLAH	73

Obviously, there's some overlap here - a number of these phrases all relate to the same much-retweeted tweet 'RT @GinaMilicia: Should just combine the #debate with #masterchef call it masterbate and may the biggest wanker win #ausvotes' -, but other than that, this gives us a pretty good impression of Twitter's views on the night: it's 'fair dinkum' well ahead of everything else, including 'moving forward'...

In fact, WordStat also provides a tool for checking potential overlaps with other phrases - 'fair dinkum', for example, appeared 32 times as 'stop saying fair dinkum', 14 times as 'fair dinkum Tony' (or in the combination of both, 'stop saying fair dinkum Tony'), and 9 times in 'fair dinkum count' (as in, 'Anyone have a fair dinkum count going?').

And the phrase finder also gives us the opportunity to create a dendrogram - a hierarchy of relations between the phrases. This is important for later - from what I can see, it seems that you must create a phrase dendrogram before you can create a word dendrogram, and it is only from the word dendrogram, finally, that you're able to export the word co-occurrence network data for plotting in Gephi. (But perhaps I'm missing an easier path here - any advice welcome!)

Anyway - for what it's worth, here's a dendrogram of the most common phrases, grouping them by co-occurrence. Not very exciting, but there it is. Note that the pre-processing settings for the phrase dendrogram are also silently adopted for the word frequencies tab - so if you choose to create a phrase dendrogram only for those words which appear at least 300 times in the dataset, then next time you go to the word frequencies tab, it, too, is filtered to show only those words appearing at least 300 times! (These settings can be changed again in the options tab.)

Going back to the word frequencies tab, the dendrogram option there is now also enabled. After adjusting the bottom frequency cut-off as necessary in the options tab, I've now created a dendrogram for the words themselves (click to expand). The most interesting observation here, perhaps, is how few keywords really address any policy issues...

But those dendrograms themselves are really only a means to an end. It's from the 'Statistics' tab in the dendrogram window that we're finally able to export our keyword co-occurrence data for visualisation in Gephi. The co-occurrence matrix which the tab shows by default can be exported in a range of formats - and I'm choosing Pajek here since I know that Gephi can import it.

So, in Gephi, we'll open the exported Pajek .net file (choosing 'undirected' as the network type - we're mapping keyword co-occurrences here, so there's no directionality), start the standard 'Force Atlas' layout algorithm, and change a few of the layout settings until we have a reasonable visualisation of the network of co-occurrences. The result? Well, it looks something like this:

Again, perhaps not a great deal of really exciting news here - the two leaders' names are obviously central, as is the worm. As the only policy-related term, 'boat(s)' is relatively prominent (but 'people' can have various meanings, so don't overestimate its appearance); other terms - like 'climate' and 'change' - remain fairly unimportant.

Note, by the way, that in this graph, the node size for each keyword is determined by its degree (the total number of other keywords it co-occurs with, counting doubles), while the distribution graph (the first graph in this post) simply shows each word's frequency (how often it appears). This is why in the network graph, a keyword node like 'boat(s)' appears larger than both 'fair' and 'dinkum': 'boat' simply occurred together with more other terms.

(To put it simply: if someone had simply tweeted 'Abbott Abbott Abbott' all the way through the debate, the keyword 'Abbott' would have racked up a massive frequency count - but its degree count from those tweets would still be zero, since it's only occurring together with itself, and not with other keywords. Here, we could hypothesise that 'fair' and 'dinkum' occur frequently together, but with a smaller range of other terms than 'boat(s)'.)

Phew! OK then - so much for now. Looks like we have a more or less workable way to generate co-occurrence data and export it for visualisation. Still not convinced about WordStat, though - there must be easier ways?

4 replies on “Twitter Concept Mapping with Wordstat and Gephi: First Steps”

Pingback: Mapping Online Publics » Blog Archive » Visualising topic-based conversation networks: the #masterchef edition
Pingback: Mapping Online Publics » Blog Archive » Twitter’s Response to Gillard (and Abbott) on Q&A
Tony Breyal says:

9 October 2010 at 06:01

Interesting post, will have to have a proper read when I get some time.

re alternatives — have you thought about using the ‘tm’ package used in the opensource R? I’ve used both WordStat and R (with the benefits of the former being that it comes in a nice package while the latter is completely customisable).

tm: http://cran.r-project.org/web/packages/tm/index.html
R: http://www.r-project.org/
Pingback: Mapping Online Publics » Blog Archive » Twitter’s Reaction to #twitdef

Comments are closed.

Published by Snurb

4 replies on “Twitter Concept Mapping with Wordstat and Gephi: First Steps”