Currently, at QUT Social Media HQ, we’re in the process of developing the new version of our Twitter capture software, led by CCI Data Scientist Troy Sadkowsky. During development, we’ve extracted a few interesting datasets, and this blog post is going to examine one of those; a set of one million Twitter IDs. This set was gathered by registering a new Twitter account on 19 March, and then capturing the user profiles of the 1 million Twitter IDs that immediately preceded that; the data being collected several days after the account creation. As it happened, these IDs had creation dates covering a range of 8 hours and, by the time we collected the data, 422,794 individual accounts. The discrepancy between the number of IDs and accounts requires further exploration; while a number of them could be closed accounts, it seems unlikely that Twitter closed almost 600,000 newly opened accounts within a few days. Thus, we are left to wonder if some IDs are never allocated, whether IDs are allocated at the start of the registration process and never activated, or whether something else entirely is going on. Regardless, the 422,794 accounts in 8 hours represents a rate of 833 new accounts per minute. There were some other interesting findings, so on we go..
Firstly, I should mention that the above diagram, and all the others in this blog post, are from Tableau rather than Excel, which we are beginning to use for our analysis. The above graph has Twitter ID on the vertical axis, and Time Created on the horizontal, covering the full range of 1 million IDs and just over 8 hours. As you can see, accounts are being allocated in a more or less linear fashion (implying that old, deleted, account IDs are not recycled), but there appears to be a slight disconnect, in that at the same time account IDs are being allocated in two different ranges. In fact, as you can see by zooming in, there are actually 3..
This graph zooms in on a smaller period of time, between approximately 6:55 and 7:30pm UTC on 18 March. By zooming in, we can see that there are three approximately parallel lines, with the bottom one being out of sync from the top 2 by almost 2000 IDs, or about 11.5 minutes. One idea for the cause of is that Twitter has three separate registration engines allocating IDs, with each engine being allocated a range of IDs periodically, however we are unable to currently verify this; it could also be that there is some caching process before new accounts are added to the database.
It is also worth noting that of these 1 million IDs, there are 1762 accounts for which the API returns profile information, but have no username. One current theory is that these may be deleted and/or banned accounts in which the username is freed for re-use, but Twitter keep the account ID active for internal recordkeeping, however again further work needs to be conducted to confirm this. Given that there were a few days between the accounts being created and the data being collected, 1762 would seem a more reasonable number than 600,000 for banned accounts.
Where are they?
One advantage of Tableau is that it allows us to produce ‘easy’ visualisations of where in the world Twitter users are. There are a couple of different ways of doing this, and they all rely on users volunteering correct information. Of the 422,794 new users, 7,461 had geo-location enabled. A map of these users can be seen below, and this provides a relatively precise measure of the location of these users. What is interesting here, particularly in reference to the diagram that follows, is that both Russia and Canada have virtually no users with geo-location, yet both have a quite substantial number of overall users. By contrast, geo-located users are more concentrated in the United States and Europe, and Mexico and South America are also strongly represented.
The second technique we used was to map the users approximate location according to the timezone set in their user profile. As it turned out, this was a relatively tedious process of mapping Twitter timezones (which use a variation of time zone (‘Eastern time’, ‘Pacific Time’), City (‘Melbourne’) and Country (‘Greenland’)). In case anyone repeats that same exercise in the future, I have made a spreadsheet of the conversion available here, which you should be able to import into Tableau in CSV format. There are a few caveats with this data; countries such as The Netherlands and Morocco seem to be over-represented, which we believe to be caused by them being the first available location for popular timezones; for example Amsterdam is the first listed alphabetically for Central European Time, which includes populous countries such as France and Germany. This data also shows large numbers of registrations for the United States and Brazil. It is also worth mentioning that the time span here, approximately 4pm – 1am UTC, would be afternoon and evening in Europe, and noon-9pm in the US, while being late night and early morning in Australia, which may explain the low number of Australians in the dataset.
What do they do?
The three charts below show number of statuses, followers, and following respectively for these users a few days after their account was created. These more or less stand alone, however it is worth noting that for the chart showing total followers I removed 5 data points for the visualisation – these appeared to be accounts of celebrities, and had 75k, 32.5k, 24.9k, 24.5k and 17.5k followers.
Status Count vs. Date of Account Creation:
Followers Count vs. Date of Account Creation (note previous caveat):
‘Following’ Count vs. Date of Account Creation:
So, that’s who’s joining Twitter — now to think about the 25.2million new accounts that may have been created by the time this post goes live..