{"id":2828,"date":"2014-09-05T15:33:05","date_gmt":"2014-09-05T05:33:05","guid":{"rendered":"http:\/\/mappingonlinepublics.net\/?p=2828"},"modified":"2014-09-05T15:55:37","modified_gmt":"2014-09-05T05:55:37","slug":"rebooting-atnix-using-mysql-and-tableau","status":"publish","type":"post","link":"https:\/\/mappingonlinepublics.net\/dev\/2014\/09\/05\/rebooting-atnix-using-mysql-and-tableau\/","title":{"rendered":"Rebooting ATNIX, using MySQL and Tableau"},"content":{"rendered":"<p>During 2012 and 2013 I published a more-or-less-weekly overview of the news sharing patterns in the Australian Twittersphere, the <a href=\"http:\/\/mappingonlinepublics.net\/2013\/08\/19\/atnix-australian-twitter-news-index-weeks-28-322013\/\">Australian Twitter News Index (ATNIX)<\/a>, which I also crossposted to <a href=\"https:\/\/theconversation.com\/columns\/axel-bruns-1433\">my column at The Conversation<\/a>. Technical issues and the call of other commitments have forced me to put ATNIX on the backburner for some time, but it\u2019s time now to update and restart the index. This post is about the methods used to generate ATNIX \u2013 and soon I\u2019ll also post a first new analysis of the data.<\/p>\n<p>First, a reminder: ATNIX builds on the fact that, given a domain name like <em>abc.net.au<\/em>, the <em>Twitter<\/em> search API will return all tweets which contain a link to that domain, even if the link has been shortened by <em>Twitter<\/em>\u2019s mandatory URL shortener <em>t.co<\/em> and\/or one of the other common URL shorteners (<em>bit.ly<\/em>, <em>ow.ly<\/em>, etc.). This makes it possible to use <em><a href=\"https:\/\/github.com\/540co\/yourTwapperKeeper\">yourTwapperkeeper<\/a><\/em> or other tracking tools to capture all of the tweets that link to one or more given domains, on an ongoing basis. As always with <em>Twitter<\/em>, there are limits to this approach, of course: first, strangely, the <em>Twitter<\/em> streaming API doesn\u2019t provide identical functionality: tracking a term like <em>abc.net.au<\/em> there does precisely nothing. This means that (for the time being) we\u2019ll continue to generate ATNIX using our trusty <em>yourTwapperkeeper<\/em> rather than more powerful tools like <em><a href=\"https:\/\/wiki.digitalmethods.net\/Dmi\/ToolDmiTcat\">DMI-TCAT<\/a><\/em>, since for tracking tweets the latter relies on the streaming API alone. Second, data gathering using the standard <em>Twitter<\/em> API is subject to <em>Twitter<\/em>\u2019s 1% rule: we\u2019ll miss tweets if the search results we should be getting add up to more than 1% of the total, global tweet volume at the time. This seems unlikely in our present context unless an article on an Australian news Website goes viral on a global basis, however.<\/p>\n<p>Those limitations aside, we\u2019re therefore able to track Australian news sharing patterns on <em>Twitter<\/em> by tracking a list of Australian domain names. Our current list includes some 35 domains of news and commentary sites ranging from <em>abc.net.au<\/em> to <em>watoday.com.au<\/em>, and from <em>crikey.com.au<\/em> to <em>thesaturdaypaper.com.au<\/em>. There are some problematic cases, though: since its move from <em>theconversation.edu.au<\/em> to a <em>.com<\/em> address, and its establishment of a UK and international contributor base, <em>The Conversation<\/em> has a much broader audience than just Australian users, so its numbers may be somewhat inflated by comparison with Australian-only opinion sites \u2013 and comparatively recent international entries into the Australian news environment such as <em>The Guardian<\/em> and <em>Daily Mail<\/em> continue to operate their Australian editions under <em>.com<\/em> and <em>.co.uk<\/em> domains, respectively, with no way to distinguish specifically Australian content on the basis of the URL alone.<\/p>\n<p>Given that <em>The Conversation<\/em> started in Australia and retains a very strong readership here, we\u2019ll continue to include it in ATNIX, then (but with strong <em>caveats<\/em>), but I\u2019m afraid <em>The Guardian<\/em> and <em>Daily Mail<\/em> won\u2019t be covered by ATNIX for the time being. In addition, we\u2019ll also filter the data to exclude some sections of Websites which cover more than news alone \u2013 for example, <em>abc.net.au<\/em>\u2019s TV guide, or <em>sbs.com.au<\/em>\u2019s (very popular) <em>Pop Asia<\/em> section. We\u2019re also using the full resolved URL to distinguish between <em>Yahoo!7 News<\/em> (<em>au.news.yahoo.com<\/em>) and <em>The West Australian<\/em>, which is also hosted by <em>yahoo.com<\/em> (at <em>au.news.yahoo.com\/thewest<\/em>). In the past, ATNIX has also used such URL paths to distinguish between news and opinion content on some of the leading news sites, but recent site redesigns at the ABC and on Fairfax sites have meant that news and opinion content can now no longer be distinguished in this way. So instead of this distinction we\u2019ll add a new table that focusses specifically on the opinion-only sites.<\/p>\n<h2>From CSV and <em>Excel<\/em> to MySQL and <em>Tableau<\/em><\/h2>\n<p>So far, so good \u2013 this method of gathering and processing our data hasn\u2019t really changed much since we started ATNIX. However, our previous method of storing and analysing these datasets (as comma- or tab-separated files, and in <em>Excel<\/em>) has become more and more unsustainable the longer we\u2019ve continued ATNIX, since the longitudinal analysis of these datasets rapidly causes problems for <em>Excel<\/em> with its antediluvian limit of a maximum of ~1 million data rows per spreadsheet. ATNIX often pulls in more tweets than that in a single month. The solution to this is the popular \u2018big data\u2019 analytics tool <em>Tableau<\/em>, which has no such limits to the size of its datasets, and is also able to connect to a range of database solutions rather than being limited to working with static data files.<\/p>\n<p>It\u2019s tempting to move immediately to a NoSQL solution for storing our ATNIX data, then, but for the moment (with some tens of millions of tweets in our current ATNIX dataset) a standard MySQL setup will still be sufficient (and once the data are in MySQL, transferring them to a more powerful cloud-based solution will be comparatively easy, anyway). So, the first step is to import our raw tab-separated data files (originally in <em>yourTwapperkeeper<\/em> format, and now with a few more columns added following the URL resolution process) into MySQL. Once we\u2019ve set up a table with the right field structure, this is comparatively easily done by using a LOAD DATA statement in MySQL \u2013 but there\u2019s a catch: as it turns out, MySQL takes a somewhat unusual approach to parsing input files for escape characters.<\/p>\n<p>We\u2019re using tab-separated files in order to ensure that commas and quotation marks in the raw tweet data don\u2019t cause any problems in identifying the start and end of fields and rows correctly. (My modifications to the original <em>yourTwapperkeeper<\/em> export functions, <a href=\"http:\/\/mappingonlinepublics.net\/2011\/06\/21\/switching-from-twapperkeeper-to-yourtwapperkeeper\/\">available here<\/a>, have already stripped out any stray tab or newline characters from the original tweets.) But when it imports data from a file, MySQL also pays attention to the \\ character, treating any backslash as an escape character rather than a literal backslash. This is especially problematic if the backslash is followed by a letter like \u2018n\u2019 or \u2018t\u2019 \u2013 in a tweet, \\n\/ would therefore be treated as a newline character followed by a slash, and break the data being imported. Similarly, a \\ at the very end of a tweet would escape the field-separating tab character which follows, so that all columns in the current row move left.<\/p>\n<p>This behaviour can be fixed with a simple Gawk helper script, though, which escapes all backslashes in our import file by doubling them:<\/p>\n<pre style=\"font-size: x-small;\"># escapebackslashes.awk - replaces any single backslash \\ with an escaped backslash \\\\ ahead of MySQL importing\r\n#\r\n# usage: gawk -F , -f escapebackslashes.awk original.csv &gt;escaped.csv\r\n#\r\n# Released under Creative Commons (BY, NC, SA) by Axel Bruns - a.bruns@qut.edu.au\r\n\r\n{\r\n\tgsub(\/\\\\\/, \"\\\\\\\\\")\r\n\tprint\r\n}<\/pre>\n<p>After importing the TSV file, all those double backslashes are single again.<\/p>\n<p>Loading the escaped datafile into MySQL is easy if a table containing the correct columns has been set up. The SQL query to do so will depend on the table structure, but will look something like this:<\/p>\n<pre style=\"font-size: x-small;\">LOAD DATA LOCAL INFILE '\u2026filename\u2026'\r\n  IGNORE \r\n  INTO TABLE `\u2026table\u2026`\r\n  CHARACTER SET 'latin1'\r\n  FIELDS TERMINATED BY '\\t'\r\n  LINES TERMINATED BY '\\n' \r\n  IGNORE 1 LINES \r\n(\r\n  field1, field2, field3, \u2026\r\n)<\/pre>\n<p>Moving our entire ATNIX dataset since mid-2012 into a single MySQL database has taken me some time, but we now finally have a full set, and the capability to continue to update it with new data as data gathering for ATNIX continues. In this database, each row contains a single tweet, including the tweet text, all the metadata gathered by <em>yourTwapperkeeper<\/em>, and the additional fields for the resolved URL which our URL resolution process has added. With this, we\u2019re ready to explore the data using <em>Tableau<\/em>.<\/p>\n<h2>Introducing <em>Tableau<\/em><\/h2>\n<p>In addition to reading CSV, TSV, and <em>Excel<\/em> files, <em><a href=\"http:\/\/www.tableausoftware.com\/\">Tableau<\/a><\/em> (which is free to licence for students, incidentally) is able to plug directly into MySQL and other databases \u2013 all that\u2019s required are the server and login details. Where it surpasses <em>Excel<\/em> (other than in not being limited to just over one million rows per dataset) is in being able to abstract from the raw data on the fly: advancing well beyond what <em>Excel<\/em> has to offer, <em>Tableau<\/em> is all pivot tables, all the time. We\u2019re able to feed it the raw ATNIX data, tweet for tweet, and from this generate day-to-day, week-to-week, and month-to-month activity patterns, distributions of attention across the various news sites, and rankings of the most active news sharers directly, without the tedious steps of data processing, extraction, and aggregation which would have been required in <em>Excel<\/em>; we\u2019re even able to define standard analytics dashboards which can be re-used time and again. And because we\u2019re still working directly with the raw data in doing so, we nonetheless retain the ability to drill down immediately to the individual tweets which are responsible for specific phenomena showing up in the analysis.<\/p>\n<p>The purpose of this post has been to document the processing steps we\u2019re taking with ATNIX, from the raw <em>yourTwapperkeeper<\/em> data to our analysis using <em>Tableau<\/em>. As we\u2019re completing the transition to this setup, I\u2019ll also soon start to post some more or less regular ATNIX analysis updates again. Down the track, I\u2019m also hoping to compare the Australian patterns with similar data we\u2019ve been gathering for Germany, Norway, and Sweden for some time now \u2013 and to examine how newssharing on <em>Twitter<\/em> in Australia compares with news engagement patterns on <em>Facebook<\/em>, and more general patterns of access to Australian news sites. More to come!<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>During 2012 and 2013 I published a more-or-less-weekly overview of the news sharing patterns in the Australian Twittersphere, the Australian Twitter News Index (ATNIX), which I also crossposted to my column at The Conversation. Technical issues and the call of other commitments have forced me to put ATNIX on the backburner for some time, but &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/mappingonlinepublics.net\/dev\/2014\/09\/05\/rebooting-atnix-using-mysql-and-tableau\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Rebooting ATNIX, using MySQL and Tableau&#8221;<\/span><\/a><\/p>\n<p><!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":2,"featured_media":2830,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[175,176,8],"tags":[208,286,283,285,284,298],"class_list":["post-2828","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-capture","category-processing","category-twitter","tag-atnix","tag-data-processing","tag-excel","tag-mysql","tag-tableau","tag-twitter","entry"],"_links":{"self":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/comments?post=2828"}],"version-history":[{"count":2,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2828\/revisions"}],"predecessor-version":[{"id":2831,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/posts\/2828\/revisions\/2831"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media\/2830"}],"wp:attachment":[{"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/media?parent=2828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/categories?post=2828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mappingonlinepublics.net\/dev\/wp-json\/wp\/v2\/tags?post=2828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}