In the following posts I’m finally keeping my promise to explore in earnest the use of Gephi‘s dynamic timeline feature for visualising Twitter-based discussions as they unfolded in real time. A few months ago, Jean posted a first glimpse of our then still very experimental data on Twitter dynamics, with a string of caveats attached – and I followed up on this a little while later with some background on the Gawk scripts we’re using to generate timeline data in GEXF format from our trusty Twapperkeeper archives (note that I’ve updated one of the scripts in that post, to make the process case-insensitive). Building on those posts, here I’ll outline the entire process and show some practical results (disclaimer: actual dynamic animations will follow in part two, tomorrow – first we’re focussing on laying the groundwork).
First, a quick overview: what we’re after is a process that provides us not only with a static map of all connections (i.e., @replies – including old-style ‘RT @user’ retweets) made between a specific group of users on Twitter during a given period of time, but a dynamic visualisation of how those connections unfolded over the course of that period: how specific users assume more or less central positions in the @reply network as time unfolds; how discussion activity waxes and wanes; how particular tweets stimulate further activity in the network (for example as users reply to them or retweet them).
Depending on your point of view and/or on the particular events around which the discussion unfolds, such visualisations may depict either how networks of communication are formed ad hoc on Twitter as acute events happen, or how already existing networks of interconnected users are communicatively active in response to new stimuli. (More on those ideas later.)
For the purposes of our discussion here, I’ll use our well-known dataset on Twitter’s coverage of the lead-up to the June 2010 Australian Labor Party leadership spill, which saw Prime Minister Kevin Rudd replaced by his then-deputy Julia Gillard. Any other dataset would work as well, but this one has the advantage of being well-researched already, neatly collected under the #spill hashtag, and condensed into a few hours on 23 June 2010: the bulk of the discussion around the initial #spill speculation took place between 7 p.m. and midnight on 23 June, AEST. (We collected these data from Twapperkeeper, some months ago – and incidentally, because of how Twapperkeeper captures its data, our archive contains any tweets made with a hashtag starting with #spill, including the short-lived #spill2, in distinction from the BP Gulf of Mexico oil spill, and #spillard, in honour of Julia Gillard.)
First Steps: How Long Do @replies Live?
We start by converting those data into a format which will allow Gephi to generate a dynamic timeline visualisation of the network, then. This follows the process I’ve outlined in my previous post on dynamic visualisation, and uses the Gawk scripts I’ve posted (and updated) there:
- preparegexfattimeintervals.awk to generate basic network data with timestamped edges
- gexfattimeintervals.awk to convert these basic data into Gephi’s GEXF format for visualisation
(Both scripts are still somewhat experimental, and I’d love to hear your feedback on my implementation of some occasionally fairly tricky maths…)
What these two scripts do, in essence, is in the first instance to filter through our Twapperkeeper archive to identify any @replies from one user within the #spill discussion to another, and the exact time when those @replies were made (in the universal Unix timestamp format). In a second step, they also set an arbitrary time at which the connection made by an @reply expires again.
If we didn’t set this time (or if the expiry time was very large), then the @reply network would end up being cumulative: any new @reply would simply add to the existing network, so that the network would steadily increase in density, and never lose any links once they’ve been made. At the very end of the overall timeframe, the network would then look exactly like a standard, static network visualisation, where we count the number of connections between two nodes (here, the number of @replies between them) to determine the numerical weight of the network edge connecting them. This may be useful in some cases, but not here: what we’re interested in is both when users join the discussion and become more active, and when their rate of contributions declines again and they cease to participate actively.
This necessarily introduces a particular interpretation of the meaning of an @reply, as a temporary connection which momentarily brings the sender closer to the receiver, increasing the sender’s visibility in the receiver’s timeline – and as the receiver receives further tweets (@replies or not) from other users, that visibility starts to decrease again until the sender is all but invisible once more. This model of interpreting @replies seems especially appropriate to a hashtag-based discussion like #spill, incidentally (as opposed to an @reply discussion within an established network of Twitter followers), since a topical, event-based #hashtag like #spill is used to bring together, more or less ad hoc, a group of users who are all interested in the same topic, but may not necessarily all already follow each other.
Strictly speaking, the decay time of an @reply should therefore also depend on a number of other parameters: for example, both the total number of tweets in a #hashtag discussion at any one point (the more activity, the more quickly any individual contribution disappears from view), and the total number of tweets in any one user’s timeline (which is dependent on the number of people the user follows – as again, the overall number of tweets the user receives will influence how quickly any one individual @reply is buried under subsequent messages). Put another way: in a slow-moving #hashtag discussion which is monitored by a user who follows few others, a received @reply can be expected to create a comparatively long-lasting connection; conversely, an @reply received by a user with a highly active overall timeline, who follows a fast-paced #hashtag discussion, is less likely to take much notice of that individual message and its sender unless further conversation follows.
From Decaying @replies to Dynamic Edges
In the absence of first-hand information about the overall volume of tweets received by each individual participant in #spill (or any other #hashtag), however, and in the interest of keeping things manageable, we can safely set an overall decay time for @replies, which is valid throughout the entire dataset we’re examining. This decay time should be appropriate to the data we’re examining – in the case of our #spill discussion on 23 June, which unfolds in the main over the course of five hours, a decay time of 30 minutes might provide a reasonable approximation, for example.
Given a decaytime value of 1800, then (since 30 minutes = 1800 seconds), what those Gawk scripts produce – and what took me some time to code – is network data in which each @reply represents a directed network edge from sender to receiver, existing between time x and x + 1800. Where multiple @replies from sender to receiver were made during the same 30-minute / 1800-second window, the weight of the edge increases for as long as the overlap lasts – for example, for a series of @replies from user1 to user2 at times t1 = 0, t2 = 1000, and t3 = 2000 (and with decaytime still set at 1800), the edge from user1 to user2 would have the following values:
- t = 0 to 1000: weight 1 (only @reply 1 is active)
- t = 1000 to 1800: weight 2 (@reply 2 becomes active)
- t = 1800 to 2000: weight 1 (@reply 1 expires, and only @reply 2 is active)
- t = 2000 to 2800: weight 2 (@reply 3 becomes active)
- t = 2800 to 3800: weight 1 (@reply 2 expires, and only @reply 3 remains)
- t = 3800 and on: weight 0 (all @replies have expired)
Visualised in Gephi, then, we would expect to see an edge from user 1 to user 2 right from the start, which would get thicker and thinner as time passes and finally disappears after t = 3800.
What our scripts don’t do (at least for now), incidentally, is to also make the nodes themselves dynamic: any user who sends or receives an @reply at any one point during the entire period will be visible as a node in the network during any subset timeframe we might select (they might just simply not have any incoming or outgoing edges during that timeframe). For the moment, that’s a deliberate limitation so that we can generate some of the visualisations which we’ll examine later – but of course some further development of our scripts would also allow for dynamically appearing and disappearing nodes…
Loading into Gephi
Having selected the subset of #spill discussion which took place between 7 p.m. and midnight on 23 June 2010, and processed it with our scripts (with decaytime set to 1800s, and a timestamp offset value so that t = 0 equals 7 p.m. exactly), we now have a dynamic GEXF file of @replies which we can import into Gephi. Measured in seconds, the maximum timeframe that this network can possibly cover is 19800s: five hours from 7 p.m. to midnight = 5 x 3600s = 18000s, plus a maximum of 1800s of decay time for any @replies made right on the stroke of midnight
In many cases, the network represented by these data will still be too dense to be useful for visualisation: the #spill network for this period still contains a total of 5485 nodes connected by 12805 edges, for example. So, we’ll filter the overall network, dropping the least significant nodes; the best way to do so for our present purposes is to filter by the degree value (which represents indegree + outdegree, that is, the sum of incoming and outgoing @replies received and made by the user) – this preserves both users who are talked about a lot, but don’t themselves participate in the discussion (like the then still active account @KevinRuddPM, or the non-existent account @KevinRuddExPM), and users who sent a large number of @replies, but for whatever reason did not receive any responses which were also hashtagged as #spill (and there were quite a few of those…).
How much you should (or have to) filter here depends on both your available processing power (networks with more nodes take longer to visualise, obviously) and the desired level of readability for your resulting visualisation (larger networks may end up looking more interesting, but if you want to start adding labels, things can get messy very quickly). For our present purposes I’m going to filter out any nodes with a combined degree value below 10 – in other words, any nodes which didn’t send or receive a total of ten @replies, at least. This leaves me with the much more manageable number of 495 key participants.
Gephi (0.7beta, at the time of writing) is still a little temperamental in its current version, so these are the steps I follow in filtering the data and turning it into a new project:
- Load the full GEXF file into Gephi (File menu > Open).
- Run the Average Degree statistics measure (Statistics tab > Network Overview).
- Filter for nodes with indegree > 10 (Filters tab > Attributes > Range > Degree; set range minimum to 10; click Filter).
- Select and copy entire visible graph (Graph tab > Rectangle selection tool > select graph; Right-click on graph and Copy to > New workspace).
- Switch to new workspace (click Workspace 0 in the bottom right-hand corner of the Gephi window, select Workspace 1).
- Export filtered data to new GEXF file (File menu > Export > Graph file; select GEXF format and save).
- Close and reopen Gephi, and load exported network data.
(Not pretty, but it works – and a click on the timeline ‘play’ button at the bottom of the main Gephi window confirms that the dynamic timeline data has remained intact, which doesn’t seem to happen reliably without these extra export/import steps.)
The Overall Network
Once we visualise this in Gephi, what we get looks roughly like this:
Node sizes in this map represent degree (i.e. indegree + outdegree – the total number of @replies sent and received), with edges between nodes showing that @replies were made from one to the other. In this, curved edges are directed connections (one node in the pair is doing much more of the @replying than the other – which may indicate frequent retweeting), while straight edges show a more balanced conversation between the two. (PDF here.)
I’ve also coloured these connections to indicate greater frequency of @replies (that is, the weight of these edges between nodes), but this may be a little misleading for this overall map: what Gephi appears to do here is to take the average weight of an edge during the time that there is an edge at all – so, a connection between two nodes which consists of a single @reply during the entire timeframe ends up with the same average weight as a connection between two nodes which exchanged a series of non-overlapping @replies (i.e. where each previous @reply had expired before the next one was made).
So, this visualisation of the network over the entire five hours probably underestimates the strength of individual connections in the network; if our goal was to visualise a static network, then a standard cumulative approach (where each @reply between two users increases the weight of the edge connecting them by a value of 1) would be better.
But this underestimation is an issue only if we’re looking at lengthy timeframes (where Gephi’s approach to taking the average edge weight produces only a relatively poor approximation of the ‘correct’ edge weight as we would normally calculate it); once we’re zooming in on shorter periods of time, Gephi’s averages will consist of a much smaller number of data points (that single @reply, and the edge which represents it, will either exist or not during a shorter time window), and the statistical error is much reduced.
(This averages error could also be addressed in part by explicitly setting edge weights to zero during times when connections are absent – right now, my scripts only generate edge data for the time that an edge is actually present. For example, for a single @reply from t = 1000s, my script would generate a line that says, essentially,
1000s to 2800s, weight 1
Gephi calculates the average weight of this edge as 1: weight 1 / 1 period = 1. If periods of absence were included, and the line said
0s to 1000s, weight 0; 1000s to 2800s, weight 1; 2800s to the end, weight 0
then Gephi’s average weight figure would look different – it would be calculated as weight 0 + weight 1 + weight 0 / 3 periods = 0.33. That’s still not taking into account the different relative lengths of the three time periods – 1000s, 1800s, and a potentially very long time from 2800s to the end of the overall time period -, but it’s an improvement. I’ll take this up in a later revision of my scripts, unless the Gephi team provide a fix for my bug report first…)
Even taking into account this underestimation of edge weights for the overall graph, though, there seems to be a relatively even distribution of discussion across the network – and what appears to be the most pronounced edge – the red line from @noirbp to @612brisbane on the bottom right-hand side of the graph – turns out to be the work of what seems to have been a (now defunct) Twitter bot which selectively retweeted the #spill tweets from various news outlets (given the name @noirbp, probably set up in response to the BP oil spill in the Gulf of Mexico, though).
The absence of highly weighted edges probably indicates that discussion within the #spill network was relatively free-ranging, without separating out into lengthy conversations between specific groups of users; this may be a common feature of #hashtag conversations, which are able to bring together users who have nothing more in common than their interest in the same topic. Alternatively and/or additionally, where longer conversations between individual users did follow, those tweets may no longer have included the #spill hashtag – a shift from participating in a public discussion (using #spill) to engaging in a more private (if still publicly visible) conversation on the sidelines (no longer using #spill), possibly via an intermediate stage where the conversation is already turning private, but is still performed to the wider audience by retaining the #spill hashtag. Once we move beyond examining #hashtag conversations only, we will be able to trace those patterns in more detail.
For this post, though, that’s quite enough. We now have the data in the right format for dynamic network visualisations – in the next post, we’ll take this baby for a spin to see what we can do with it.