Methods Processing Twitter — Snurb, 20 October 2010
Dynamic Networks in Gephi: From Twapperkeeper to GEXF

In between last week’s ECREA conference in Hamburg, where we presented some of our methodologies and early outcomes from the Mapping Online Publics project, and the AoIR conference in Gothenburg, where we’ll talk some more about tracking and mapping interaction in online social networks, I wanted to finally follow up on Jean’s teaser post of a dynamic animation of Twitter @reply activity from a couple of weeks ago. This animation of network activity over time has become possible with the release of the latest beta version of Gephi, the open source network visualisation software, which now includes support for time-based data – and on the flight over to Europe as well as in between conferences and workshops, I’ve made some first steps towards building the tools to prepare our Twitter data for such dynamic visualisations.

First, though, I need to stress that the video which we posted a little while ago was only a very preliminary attempt; in the meantime, and with considerable and speedy support from the Gephi team (thanks, guys!), we’ve managed to improve our methods significantly. In the following, I’ll explain what our current approach looks like; a little further down the track, we’ll also post another animation of the results.

So, to begin with, we’re once again starting with a dataset from Twapperkeeper, which tracks user activities around Twitter #hashtags. Within these data, the network which we can identify and visualise consists of the @replies between participating users (and there’s a choice to be made here of whether old-style retweets – i.e. “RT @username …” should be included or not); I’ve posted previously on our methodology for extracting static @reply networks. What we need to do now is extract the @reply information from the Twapperkeeper data in such a way as to make it usable for dynamic visualisation in Gephi.

The first step in this is relatively simple, and builds on what we’ve done before. Using our trusty CSV processing tool Gawk, we’ll extract all @replies and their (Unix) timestamps – but in preparation for what we’ll be doing later, we’ll already compile all @replies from one specific user to another, and list the timestamps of those @replies. The Gawk script below, then, generates a CSV with lines that look something like this:

user1, user2, timestamp1;timestamp2;timestamp3;etc.

Note in this that @replies are a form of directed edge: user1 may tweet at user2, but that doesn’t mean that user2 necessarily replies. So, the resulting CSV file may or may not contain a corresponding line

user2, user1, timestamp4;timestamp5;etc.

Here’s the script, which also takes an optional parameter offset that is subtracted from all timestamps, just to reduce those very large numbers in the Unix timestamps to something a little more manageable. If no offset parameter is given, the script simply takes the timestamp of the first tweet in the dataset as the value for offset (in other words, the first tweet now happens at zero seconds).

UPDATE: The first version of this script didn’t make any attempt to combine edges between users occurring with different capitalisation (for example, username1 -> username2 should be identical to UserName1 -> UserName2, since Twitter is case-insensitive). This led to false duplicates in Gephi visualisations (username1 and UserName1 would appear as separate nodes). The revised version below unifies usernames, preferring whichever variant appears first in a sorted list (usually the capitalised version, e.g. UserName1 over username1).

      
# preparegexfattimeintervals.awk - Extract @replies for network visualisation
#
# this script takes a CSV archive of tweets, and reworks it into network data for visualisation
# data includes the timestamp of the tweet - multiple tweets between two users are concatenated with a semicolon (starttime1;starttime2;starttime3;...)
#
# expected data format:
# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time
#
# output format:
# source,target,starttimes
#
# script takes an optional offset argument, which is subtracted from the timestamp
# if offset is not set on the command line, the first tweet's timestamp is used instead
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au

BEGIN {
	getline 									# populate the FILENAME variable

	print "source,target,starttimes"

	## first, build list of unique usernames in standard capitalisation style
	## (to avoid duplicates in different styles, e.g. @UserName vs. @username)

	getline < FILENAME								# skip header row

	i=1
	while(getline < FILENAME) {							# start first parse of input file
		nodes[i] = $3								# add any active tweeters
		i++

		if(starttime == 0) starttime = $13
		endtime = $13

		a=0
		do {									# add all @reply recipients
			match(substr($1, a),/@([A-Za-z0-9_]+)?/,atArray)
			a=a+atArray[1, "start"]+atArray[1, "length"]
			if (atArray[1] != 0) {
				nodes[i] = atArray[1]
				i++
			}
		} while(atArray[1, "start"] != 0)
	}
	close(FILENAME)

	n = asort(nodes, sorted)							# sort and remove duplicates
	j = 1

	for (i=2; i<=n; i++) {
		if(tolower(sorted[i]) == tolower(sorted[j])) { 
			delete sorted[i]
		} else { 
			j=i 
		}
	}

	n = asort(sorted,sortednodes)						# sort remaining list

}

/@([A-Za-z0-9_]+)/ {									# main loop: match any line including @reply
	if(!offset) offset = $13							# if not provided as script argument, set offset to first occurring tweet timecode

	a=0
	do {										# repeat for each @reply in the line (i.e. execute twice for '@user1 @user2 tweet content')
		match(substr($1, a),/@([A-Za-z0-9_]+)?/,atArray)
		a=a+atArray[1, "start"]+atArray[1, "length"]

		if (atArray[1] != 0) {
			for(i in sortednodes) {					# match source and target users against sorted master list of participating users in standard capitalisation
				if(tolower(sortednodes[i]) == tolower($3)) { source = sortednodes[i] }
				if(tolower(sortednodes[i]) == tolower(atArray[1])) { target = sortednodes[i] }
			}

			if(!edge[source "," target]) {
				edge[source "," target] = $13 - offset		# create new edge[source,target] array entry, or
			} else {
				edge[source "," target] = edge[source "," target] ";" $13 - offset       # add to existing entry
			}
		}

	} while(atArray[1, "start"] != 0)

}

END {
	for(i in edge) {
		print i "," edge[i]
	}
}

        

So, this preparegexfattimeintervals.awk script, which should be called with the command line

gawk -F , -f preparegexfattimeintervals.awk twapperkeeper.csv offset=offset >networkedges.csv

simply lists all edges, and the (potentially multiple) timestamps at which they're created.

If we wanted simply to create a static network visualisation over the entire period covered by the dataset, we'd be nearly done now - we could simply count the number of timestamps attached to each edge, and treat this as the edge weight: an connection between two users which was made only once (i.e. an edge with only one timestamp) would get a weight of 1, while a repeated connection between two users (i.e. multiple @replies from one user to another, at different times) would get a higher weighting and signify a more intensive, closer association between these two users.

However, that's not quite good enough for a dynamic visualisation. Here, we need to define when an edge (i.e. a connection from one user to another) comes into existence, and when it disappears again. The former is easy: we know exactly the timestamp at which each @reply was made, and the point of the previous script was precisely to create - for each pair of users between whom an @reply has been exchanged - a list of the timestamps at which those connections were made.

Now, however, comes the tricky bit: we need to come up with an expiry time. After all, one user sending an @reply to another doesn't mean that those users remain connected forever - far from it: the target user may not see the @reply at all, may not care, may quickly forget again that they received it. However, repeated @replies from one user to another would make the originating user more and more visible (and perhaps memorable) to the recipient, and so we can assume a stronger connection - even more so, of course, if those @replies from user A to user B are reciprocated by user B.

By analogy, the principle is a little similar to the way old-fashioned cathode ray tube TV screens worked: a single electron hitting the screen doesn't create much more than a brief blip which fades away again quickly - but a steady ray of electrons hitting the same spot creates a more permanent image. In essence, what we've got to work out is how soon, on Twitter, the light fades away again; we're looking for the decay time of a tweet.

Now, very clearly, there is no easy answer to this, and all we can come up with are some approximate values. Indeed, we could make a credible argument that the decay time depends on the receiving user: someone who gets a lot of @replies within a short period of time might remember each individual one only for a very limited time, and might consider only those users as close connections who @reply very frequently; someone who rarely ever gets @replies might value them considerably more.

But let's not overpsychologise these questions, either. Let's simply assume, for the moment, that a connection made by a single @reply lasts for an hour (3600 seconds), for example, and see what outcomes that produces; we can try other values (say, half an hour, i.e. 1800 seconds) later.

If we apply one of those 'decay times' to our data, then, we're almost ready to start our dynamic visualisation. We now have a start and an end time for each of our @reply edges between two users in the network. Take the example above:

user1, user2, timestamp1;timestamp2;timestamp3;etc.

This would mean that the directed edge from user1 to user2 should be visible in the network from timestamp1 to timestamp1 + 3600s, from timestamp2 to timestamp2 + 3600s, and from timestamp3 + timestamp3 + 3600s. In Gephi's internal timeline notation, this is relatively easy to express:

<[timestamp1, timestamp1 + 3600]; [timestamp2, timestamp2 + 3600]; [timestamp3, timestamp3 + 3600]>

However, what happens if there's an overlap between these periods: if timestamp2 is a time which sits within the time during which the tweet made at timestamp1 is still 'active' (that is, has not expired yet)?

Imagine, for example, that the first tweet is made at 0s, and therefore expires at 3600s, and the second tweet is made half an hour later, at 1800s, and thus expires at 5400s: this would mean that between 0s and 1800s one @reply tweet is active (the connection from user1 to user2 should have an edge weight of 1), between 1800s and 3600s two @reply tweets are active (the connection should have an edge weight of 2), and between 3600s and 5400s only the second tweet remains active (the edge weight drops down to 1 again). In reality, of course, things may be considerably more complex still, with multiple overlapping tweets active at the same time.

Here we're getting to the limit of what can be usefully expressed and imported into Gephi in CSV format. There is a way to express dynamically changing edge weights in Gephi notation - for the example above, it would be something like

<[0, 1800, 1.0]; [1800, 3600, 2.0]; [3600, 5400, 1.0]>

- but there doesn't seem to be a way to import such data into Gephi in CSV format, even if it is correctly formatted; all edge weights are reset to 1.

So, we need to switch to the XML-based GEXF format for network description now. GEXF allows us to define 'time slices' during which the weight of an edge can be set to a distinct value. What we need is a script that takes the CSV output of our previous script (a list of directed edges, with their specific start times), and - for any given 'decay time' - calculates the time slices and edge weights that apply.

To do so requires some pretty funky maths, as - depending on the decay time we're using - any number of tweets could be overlapping in various ways at any one point. I think the following Gawk script does the trick - but I'd be very grateful for independent verification and improvement suggestions. Use at your own risk!

      
# gefxattimeintervals.awk - Process CSV of network edges and start times for network visualisation
#
# this script takes a CSV of network edges with attached start times (multiples allowed), and converts it into a GEXF for visualisation
#
# expected data format:
# source,target,starttime[;starttime2[;starttime3[;...]]]
#
# output format:
# GEXF network file. Weights of multiple edges whose timeframes overlap are increased.
#
# script takes an optional decaytime argument, which sets the expiry time of each edge - edge starting at time t expires at t + decaytime
# if decaytime is not set on the command line, decaytime is set to 100
#
# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au


BEGIN {
	getline
	print "<gexf xmlns=\"http://www.gexf.net/1.1draft\"\n       xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"\n       xsi:schemaLocation=\"http://www.gexf.net/1.1draft\n                             http://www.gexf.net/1.1draft/gexf.xsd\"\n      version=\"1.1\">\n  <graph mode=\"dynamic\" defaultedgetype=\"directed\">\n  	<attributes class=\"edge\" mode=\"dynamic\">\n	  <attribute id=\"weight\" title=\"Weight\" type=\"float\"/>\n	</attributes>\n	<edges>"

	if(!decaytime) decaytime = 100
}

{
	max = split($3,time,";")

	for(i = 1; i <= max; i++) {
		edgestart[i] = time[i]
		edgeend[i] = time[i] + decaytime
		edgeweight[i] = 1
	}

	if(max > 1) {
		for(i = 2; i <= max; i++) {
			top = max
			for(j = 1; j <= max; j++) {
				if((edgestart[i] != edgeend[i]) && (edgestart[j] <= edgestart[i]) && (edgeend[j] > edgestart[i]) && (j != i)) {
					edgestart[max+1] = edgestart[i]
					edgeend[max+1] = edgeend[i] < edgeend[j] ? edgeend[i] : edgeend[j]
					edgeweight[max+1] = edgeweight[i] + edgeweight[j]

# two possibilities here: a) edge i starts inside the edge j timeframe, but finishes later; b) edge i timeframe is completely within edge j timeframe

					if(edgeend[j] > edgeend[i]) {
						edgeend[i] = edgeend[j]
						edgeweight[i] = edgeweight[j]
					}
					edgeend[j] = edgestart[max+1]
					edgestart[i] = edgeend[max+1]
					max++
				}
			}
		}
	}
	
	interval = weight = slice = ""
	start = edgestart[1]
	end = edgestart[1]

	for(i = 1; i <= max; i++) {
		if(edgestart[i] != edgeend[i]) {
			if(edgestart[i] < start) start = edgestart[i]
			if(edgeend[i] > end) end = edgeend[i]

			slice = slice "				<slice start=\"" edgestart[i] "\" end=\"" edgeend[i] "\" />\n"
			weight = weight "				<attvalue for=\"weight\" value=\"" edgeweight[i] "\" start=\"" edgestart[i] "\" end=\"" edgeend[i] "\"/>\n"

		}
	}

	print "		<edge source=\"" $1 "\" target=\"" $2 "\" start=\"" start+0 "\" end=\"" end "\" weight=\"0\">"
	print "			<attvalues>"
	print weight
	print "			</attvalues>"
	print "			<slices>"
	print slice
	print "			</slices>"
	print "		</edge>"
}

END {

	print "    </edges>\n  </graph>\n</gexf>"

}

        

This script, too, takes an optional argument, decaytime, which sets the period during which edges in the network are understood to be active. If the decaytime argument is not provided, decaytime is arbitrarily set to 100. So, in our example above, where tweets are thought to expire after one hour (3600s), you'd call up the script using the following command line:

Gawk -F , -f gexfattimeintervals.awk decaytime=3600 networkedges.csv >replynetwork.gexf

Incidentally, the script only generates GEXF data for the network edges; it does not define the nodes (i.e. the Twitter users in the network) in the resulting GEXF file. Happily, Gephi is flexible enough to automatically create any unknown nodes which it finds in the data it imports, which allows us to keep the script manageable.

The key content of the GEXF file will look something like this now:

      
		<edge source="user1" target="user2" start="0" end="5400" weight="0">
			<attvalues>
				<attvalue for="weight" value="1" start="0" end="1800"/>
				<attvalue for="weight" value="2" start="1800" end="3600"/>
				<attvalue for="weight" value="1" start="3600" end="5400"/>
			</attvalues>
			<slices>
				<slice start="0" end="1800" />
				<slice start="1800" end="3600" />
				<slice start="3600" end="5400" />
			</slices>
		</edge>

        

In other words, this both defines the time slices during which the edge is visible, and the edge weight it has during those times. This, then, we can now import into Gephi, which on import will alert us that it's created all those nodes missing from the GEXF file (if the 'create missing nodes' import option was ticked).

If - like us - you're dealing with a network consisting of a large number of nodes, the first step before getting to any dynamic visualisation might actually be thinning out the network, though - applying a global filter that hides any nodes with an indegree below a specific number, then selecting all of the nodes which have remained visible and copying them to a new workspace (select > right-click > copy to new workspace; then switch to the new workspace). Note that Gephi can occasionally get a little confused about its data at this stage - so it's a good idea to export (not save) the filtered network dataset as a new GEXF file at this point, close the current project in Gephi, and reload the filtered GEXF from scratch.

From here, all we need to do is switch on the timeline filter (the big 'play' button at the bottom of the Gephi screen), and use the sliders to zoom in on specific timeframes in the overall network dataset. In terms of visualising these data, then, there are two major options here:

  1. To run a network visualisation algorithm over the entire dataset until the network settles down, and then switch it off, before selecting specific timeframes: this maps the overall, longer-term network structure, and then (when different timeframes are selected) shows which sections of the network are most active at any one point. In essence, this gives us a sense of the longer-term interconnections between users, which (depending on the overall period covered by the entire dataset) may be more or less permanent, and indicates which clusters in the network converse most actively at any given time.

      
  2. To run a network visualisation algorithm and keep it switched on while changing the timeframe selection sliders: this shows how individual nodes (users) join and leave the @reply network. In essence, this shows us only the current conversational network, without taking into account any more permanent structures of interconnection.


So, there's our methodology for dynamic network visualisations as it stands right now. As we find time to process our existing Twapperkeeper data, we'll post up some more animations of these dynamic @reply networks (and perhaps do the same for our necessarily more slowly moving blog interlinkage data as well) - but perhaps this methods post might also enable some of you to do some dynamic network mapping of your own.

As always, any feedback would be much appreciated!

About the Author

Dr Axel Bruns leads the QUT Social Media Research Group. He is an ARC Future Fellow and Professor in the Creative Industries Faculty at Queensland University of Technology in Brisbane, Australia. Bruns is the author of Blogs, Wikipedia, Second Life and Beyond: From Production to Produsage (2008) and Gatewatching: Collaborative Online News Production (2005), and a co-editor of Twitter and Society, A Companion to New Media Dynamics and Uses of Blogs (2006). He is a Chief Investigator in the ARC Centre of Excellence for Creative Industries and Innovation. His research Website is at snurb.info, and he tweets as @snurb_dot_info.

Related Articles

Share

(24) Readers' Comments

  1. Great, thanks for posting this. I’ll work on an implementation based on R (instead of Awk) to extend my script for static networks (see http://blog.ynada.com). R’s igraph extension allows saving directly to formats like GraphML, but since it doesn’t support GEFX at this point I might have to write the graph file manually. Still, it should be a lot easier to implement using your code.

  2. Pingback: Mapping Online Publics » Blog Archive » Visualising Twitter Dynamics in Gephi, Part 2

  3. Pingback: Dynamic Twitter graphs with R and Gephi (clip and code)

  4. Thanks for this post! It worked until I ran the gexfattimeintervals.awk script. I get the following error after running the command

    gawk -F , -f gexfattimeintervals.awk decaytime=3600 networkedges.csv >replynetwork.gexf

    gawk: -F
    gawk: ^ invalid char ‘?’ in expression

    any ideas?

  5. Yep – looks like you’re using an em-dash rather than just a regular hyphen there…

    Axel

  6. Oh! Now I see :) It is always the smallest errors that are so easily overlooked.

    Thanks! Having the most fun here experimenting with this.

  7. I have been trying to do this, got it to work with the YourTwapperKeeper tool. But now I want to do this with my own archive of tweets, scraped via the Twitter API because this gives me more flexibility in terms of scraping peoples time lines. I’ve created a mysql database with the exact same structure (text,to_user_id,from_user,id,…. and so forth).

    I have exported these tweets (via phpmyadmin) to a comma seperated csv file but the when processing it with gawk script I got an empty ‘networkedges.csv’. I have also tried a different method, exporting the tweets from the database to a excel file and than in Microsoft Excel (2008 for Mac) save the file as a comma seperated csv file. But when processing the csv file with gawk it again resulted into an empty networkedges.csv.

    Any ideas on how to best create a gawk compatible csv file without using TwapperKeeper? What setting should I use for creating csv files when using Microsoft Excel?

    Best wishes,
    Ryanne

    • Hi Ryanne,

      this is difficult to diagnose without knowing the exact structure of your original data – but if the columns are correct (we need the text, from_user, and timestamp columns only) this should work. They need to be in the correct space in the CSV, too – texts in column one, from_user in column three, timestamp in column 13, if I remember correctly !

      If you’re using a Mac, you have the added problem that Macs use non-standard carriage return codes which Gawk has trouble comprehending. There’s a way to export CSVs in MS-DOS format from Mac Excel, I think – can’t remember the exact settings, though.

      Hope that helps !

      Axel

  8. Pingback: Twitter dynamics visualisation in Gephi | rlturenhout

  9. Hi Axel,

    Thanks for your comment, it pointed me in the right direction. Gawk is picky about comma’s and foreign language characters! True, it is difficult to help without knowing the full specs and steps, just couldn’t find an e-mail address so quickly. I really want to do research with this but otherwise I’ll just stick with yourtwapperkeeper if it takes to much time to get it to work without.

    I got it to work with the Gawk scripts (it took lots of steps to get to this!), only now Gephi won’t load the replynetwork.gexf file :S on the surface there is nothing wrong with the gexf file, no empty sources or targets or anything I can detect. I’ll post a message on the Gephi forum about this. It makes no sense and Gephi does not give any error messages. The files that I uses are in my website url (zip file).

    I’ll document all steps of course and write a blog post on how to do this without Twapperkeeper.

    Best wishes,
    Ryanne

    • Hi Ryanne,

      I am working on a graduate research project and trying to figure out the problem since I am having exactly the same thing over here! When I run the command it works fine but outputs and empty networkedge.csv

      Can you please share your steps to resolve this problem with me, too? It will definitely help so much and I’l be owed.

      Thanks,

      • Turned out that I’ve happened to solve this step by somehow and populated a gefx file out of my output.csv. This time, however, the whole structure of GEXF has apparently changed. Axel, I know, you might be dealing with other parts of your research, but this script needs also some revising given its disparity between its file format and the file format released under the latest version of Gephi. Instead of start/end xml attributes and slices, they seem to introduce a new attribute called spell. Here is the link if you want to check: http://gexf.net/format/

        It’s also noted by the Gephi community outreach here in this thread: http://forum.gephi.org/viewtopic.php?t=1419

        Yusuf.

        • After spending a couple of hours, I acknowledge the nonsense of my previous comment. I’m operating on the gexf file; chopping, trimming, and sizing down, and suddenly gephi starts importing the file. However, the intact file is no way able to go through the prompt screen that just throws when I try to import the original form of the file – saying, “the left endpoint of the interval must be less than the right endpoint”. Grrrr…

          • Yusuf,

            glad you got it worked out without me needing to re-code anything…

            Out of interest: is your original Twitter dataset in ascending chronological order (as expected by the Gawk scripts)? If not, there’s your problem, as they say.

            Axel

          • Bingo! It’s been almost a week that I’ve been trying to make this! I’ve started from the scratch with almost knowing nothing about social network visualizations and here we go. Up to this point, I can definitely tell that I owed your works a lot. Thank you very much and I’ll now continue reading your posts on Visualizing Twitter Dynamics. At the end of this research, I will also definitely include your name in my acknowledgments. Thanks!

          • No worries, and thanks – glad it’s useful for you…

            Axel

  10. Nevermind, my original csv is still corrupt. Excel breaks the rows when exporting to csv on either a windows machine or a mac. Meaning that I suddenly have 6267 rows instead of 6230 rows, spreading some tweets out over several rows. Going to write a php script to do this work for me now (mysql -> csv), should have done that in the first place, probably much less work.

  11. Pingback: The New Media and Digital Culture twittersphere | rlturenhout

  12. How do you plan to replace twapperkeeper?

  13. Thank you Axel.

  14. Pingback: Mapping Online Publics » Blog Archive » Twapperkeeper and Beyond: A Reminder

  15. You guys did a lot jobs, thanks!I started to do some working on Data virsualizations about Microblogs.Your posts helped me pretty much.

  16. Thank you very much for you information Axel ;))

    Greetings from Barcelona !