{"id":27,"date":"2010-07-22T18:59:55","date_gmt":"2010-07-22T08:59:55","guid":{"rendered":"http:\/\/www.mappingonlinepublics.net\/dev\/2010\/07\/22\/creating-twitter-timelines-from-twapperkeeper-data\/"},"modified":"2012-04-10T13:20:13","modified_gmt":"2012-04-10T03:20:13","slug":"creating-twitter-timelines-from-twapperkeeper-data","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2010\/07\/22\/creating-twitter-timelines-from-twapperkeeper-data\/","title":{"rendered":"Creating Twitter Timelines from Twapperkeeper Data"},"content":{"rendered":"<p>This is the first in what will be an irregular series of methods posts outlining some of our approaches to working with datasets from various sources. Part of our work over the next few weeks will be to examine what happens in the Australian Twittersphere around the upcoming federal election, so I figured it would be a good idea to start with some of the basics of working with <em><a href=\"http:\/\/twapperkeeper.com\/\">Twapperkeeper<\/a><\/em> data. (Note that what I&#8217;ll outline here is a working solution, but not necessarily an elegant one &#8211; if anybody has a better suggestion, we&#8217;d love to hear it.)<\/p>\n<p><em>Twapperkeeper<\/em> is an online tool for capturing (public) tweets that contain specific #hashtags, keywords, or @usernames. The datasets it creates are delivered in a standard comma-separated value (CSV) format &#8211; including fields such as the tweet itself, the username of the poster, and a timestamp in various formats, as well as a few other bits of backend information.<\/p>\n<p>One of the most immediate points of interest in working with a <em>Twapperkeeper<\/em> dataset is often to get a sense of the tweet timeline: how does the volume of tweets change over time, for example in response to events occurring in the world? The datasets provide that information &#8211; but to create an accurate visualisation of the timeline needs some doing. In this post, I&#8217;m going to work through an example, using <em>Twapperkeeper<\/em> data collected by my colleague Jean Burgess during the recent Australian Labor Party leadership spill (centred around the #spill hashtag and a few related ones). <\/p>\n<p> <!--more-->  <\/p>\n<p>By default, the <em>Twapperkeeper<\/em> dataset includes two timestamp fields: <em>created_at<\/em> (human-readable, e.g. &#8220;Sat Jun 19 00:28:04 +0000 2010&#8221;, and <em>time<\/em> (in Unix format, e.g. &#8220;1276907284&#8221;). Both are problematic: the format for <em>created_at<\/em> can be inconsistent &#8211; sometimes the &#8216;+0000&#8217; GMT offset is placed before the year, sometimes after it -, and the Unix timestamp has a second-by-second resolution that may be too high if we&#8217;re dealing with lower volumes of tweets (e.g. only a few relevant tweets per minute, or per hour).<\/p>\n<p>So, to be able to visualise the data in a more meaningful way, we&#8217;ll need to condense the timestamps to larger units of time &#8211; minutes, hours, days. I&#8217;ve put together a quick <a href=\"http:\/\/www.gnu.org\/software\/gawk\/\">Gawk<\/a> script which helps with this (Gawk is a command-line tool that&#8217;s perfect for processing CSV files &#8211; and it&#8217;s available for Unix, <a href=\"http:\/\/gnuwin32.sourceforge.net\/packages\/gawk.htm\">Windows<\/a>, and Mac):<\/p>\n<\/p>\n<div class=\"wlWriterEditableSmartContent\" id=\"scid:887EC618-8FBE-49a5-A908-2339AF2EC720:22a1aaed-7954-4813-9639-5d7c060730af\" style=\"padding-right: 0px; display: inline; padding-left: 0px; float: none; padding-bottom: 0px; margin: 0px; padding-top: 0px\">         <code><\/p>\n<pre>\n      \n# explodetime.awk - Explode tweet time information\n#\n# this script takes a CSV archive of tweets, and adds various additional timestamp representations\n#\n# expected data format:\n# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time\n# (i.e. typical Twapperkeeper format)\n#\n# output format:\n# text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,time,timestamp,year,month,day,hour,minute\n#\n# tweet time is provided at various levels of specificity for time-based activity mapping:\n#\n# FIELD     - CONTENT\n# time      - full human-readable timestamp\n# timestamp - full Unix timestamp\n# condensed - Unix timestamp, condensed to 100-second intervals (i.e. last two digits changed to '00')\n# year      - YYYY only\n# month     - YYYY-Mon\n# day       - YYYY-Mon-DD\n# hour      - YYYY-Mon-DD HH\n# minute    - YYYY-Mon-DD HH:MM\n#\n# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\n\nBEGIN {\n\tprint \"text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,time,timestamp,condensed,year,month,day,hour,minute\"\n}\n\n{\n\tprint $0  \",\" substr($13,0,8) \"00,\" strftime(\"%Y\",$13) \",\" strftime(\"%Y-%b\",$13) \",\" strftime(\"%Y-%b-%d\",$13) \",\" strftime(\"%Y-%b-%d %H\",$13) \",\" strftime(\"%Y-%b-%d %H:%M\",$13)\n} \n        <\/pre>\n<p><\/code>\n      <\/div>\n<\/p>\n<p>This is called up as follows:<\/p>\n<p><strong>gawk -F , -f explodetime.awk <em>sourcefile.csv<\/em> &gt;<em>outputfile.csv<\/em><\/strong><\/p>\n<p>The resulting CSV output file contains a number of additional columns: &#8216;condensed&#8217;, in which the last couple of digits in the Unix timestamp (the seconds) are changed to zero, and a range of columns which provide a range of condensed human-readable timestamp variations &#8211; from the year only through to a &#8216;YYYY-Mon-DD HH:MM&#8217; format.<\/p>\n<p>For many purposes, the higher-level human-readable versions will be enough &#8211; provided there&#8217;s at least one relevant tweet in any given hour, for example, you could load this new CSV file into Excel and create a pivot table that plots the new &#8216;hour&#8217; field (on the X-axis) against a count of all tweets during that hour (on the Y-axis). For our #spill data, this would look as follows:<\/p>\n<p><a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image.png\"><img decoding=\"async\" loading=\"lazy\" title=\"image\" style=\"border-top-width: 0px; display: inline; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px\" height=\"374\" alt=\"image\" src=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image_thumb.png\" width=\"604\" border=\"0\" \/><\/a> <\/p>\n<p>But there&#8217;s a problem. This method only works if there <em>are<\/em> data for each period &#8211; Excel isn&#8217;t smart enough to recognise that we&#8217;re dealing with time-based data here, and won&#8217;t automatically insert empty spaces where there&#8217;s a gap in the timeline. Take an excerpt from the pivot table for our #spill dataset, for example:<\/p>\n<table cellspacing=\"0\" cellpadding=\"0\" border=\"0\">\n<tbody>\n<tr>\n<td width=\"98\">2010-Jun-24 15<\/td>\n<td width=\"88\">245<\/td>\n<\/tr>\n<tr>\n<td>2010-Jun-24 16<\/td>\n<td>141<\/td>\n<\/tr>\n<tr>\n<td>2010-Jun-24 18<\/td>\n<td>256<\/td>\n<\/tr>\n<tr>\n<td>2010-Jun-24 19<\/td>\n<td>250<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>\n  <br \/>For some reason (perhaps <em>Twitter<\/em> or <em>Twapperkeeper<\/em> was down), no #spill tweets appear in the dataset for hour 17 of 24 June. But that gap doesn&#8217;t appear in the Excel graph above &#8211; instead, hour 18 follows on directly from hour 16: 5 p.m. never happened. Excel&#8217;s limittaions severely compromise our results: gaps don&#8217;t appear, and in fact the times listed on the X-axis aren&#8217;t even evenly spaced!<\/p>\n<p>To fix this, we could manually insert zeroes into our data (e.g. a new line &#8216;2010-Jun-24 17&#8217;, &#8216;0&#8217; in the table above), but that isn&#8217;t feasible for large datasets, of course. Instead, here&#8217;s a somewhat easier way to do it:<\/p>\n<ol>\n<li>Run the Gawk script above over the dataset to add those extra columns. <\/li>\n<li>Load into Excel, and create a pivot table: plot &#8216;condensed&#8217; (the Unix timestamp, rounded down to 100-second intervals) against the count of &#8216;timestamp&#8217; (pretty much any other field will do as well &#8211; we&#8217;re just counting, not summing). <\/li>\n<li>Now copy both columns of the pivot table into a new worksheet. <\/li>\n<li>Select both copied columns, and insert a &#8216;Scatter with only Markers&#8217; graph. <\/li>\n<\/ol>\n<p>\n  <br \/>The crucial benefit of using a scatter graph is that, contrary to other graph types, it <em>does<\/em> preserve gaps in the data. For that reason, I&#8217;m also using a scatter plot with markers for data points, rather than with interpolation lines &#8211; in our example above, an interpolation line would suggest that there were somewhere between 141 and 256 tweets in hour 17 of 24 June, but that&#8217;s clearly incorrect. Instead, in the graph below, you can just make out a gap in tweet activity after the second major spike &#8211; just as the data tell us there should be.<\/p>\n<p><a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image1.png\"><img decoding=\"async\" loading=\"lazy\" title=\"image\" style=\"border-top-width: 0px; display: inline; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px\" height=\"320\" alt=\"image\" src=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image_thumb1.png\" width=\"530\" border=\"0\" \/><\/a> <\/p>\n<p>Unfortunately, there&#8217;s not much point labelling the unit markers on the horizontal axis, because we&#8217;ve had to use Unix timestamps for the data itself &#8211; so the unit markers would read something like &#8216;1276914303&#8217;. However, set an interval of 3600 (60 seconds x 60 minutes = 1 hour) for minor axis markers and 86400 (3600 seconds x 24 hours = 1 day) for major axis markers, and they&#8217;ll correspond to hours and days. To start and finish the X-axis at a given point in time, use <a href=\"http:\/\/www.4webhelp.net\/us\/timestamp.php\">a tool like this<\/a> to find the timestamp to corresponds to that time (don&#8217;t forget to set the appropriate GMT offset!), and use those timestamp values as minimum and maximum values for the X-axis. <\/p>\n<p>So, for our purposes, the three core days (23 to 26 June 2010) in the #spill tweet data look like this &#8211; note the now much more obvious gaps in the data during parts of day two:<\/p>\n<p><a href=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image3.png\"><img decoding=\"async\" loading=\"lazy\" title=\"image\" style=\"border-top-width: 0px; display: inline; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px\" height=\"389\" alt=\"image\" src=\"http:\/\/www.mappingonlinepublics.net\/dev\/wp-content\/uploads\/2010\/07\/image_thumb3.png\" width=\"644\" border=\"0\" \/><\/a> <\/p>\n<p>The purpose of this post has been to describe the methods for creating this graph; I&#8217;m not going to go into its meaning here, but will leave that for another post. It&#8217;s very notable how the early excitement spiked in the latter hours of day one (23 June), with a lull during the early morning hours (in Australia) of 24 June. There&#8217;s another time of peak activity during the middle of 24 June, before and after the actual spill vote, though at a lower level of intensity &#8211; perhaps people had moved over to follow the live broadcasts on radio and TV, or discussion had branched out from #spill to other tags?<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This is the first in what will be an irregular series of methods posts outlining some of our approaches to working with datasets from various sources. Part of our work over the next few weeks will be to examine what happens in the Australian Twittersphere around the upcoming federal election, so I figured it would &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2010\/07\/22\/creating-twitter-timelines-from-twapperkeeper-data\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Creating Twitter Timelines from Twapperkeeper Data&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","footnotes":""},"categories":[173,176,8],"tags":[12,10,7,297,11,6,9,298],"class_list":["post-27","post","type-post","status-publish","format-standard","hentry","category-politics-2","category-processing","category-twitter","tag-spill","tag-australia","tag-gawk","tag-methods","tag-politics","tag-tools","tag-twapperkeeper","tag-twitter","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/27","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=27"}],"version-history":[{"count":0,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/27\/revisions"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=27"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=27"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=27"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}