Social media research is on the rise but researchers are increasingly at the mercy of the changing limits and access policies of social media platforms. API and third party access to platforms can be unreliable and costly. Sam Kinsley outlines the limitations and stumbling blocks when researchers gather social media data. Should researchers be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results?
Many of the research articles and blogs concerning conducting research with social media data, and in particular with Twitter data, offer overviews of their methods for harvesting data through an API. An Application Programming Interface is a set of software components that allow third parties to connect to a given application or system and utilise its capacities using their own code. Most of these research accounts tend to make this process seem rather straight forward. Researchers can either write a programme themselves, such as, or can utilise one of several tools that have emerged that provide a WYSIWYG interface for undertaking the connection to the social networking platform, such as implementing yourTwapperKeeper, COSMOS or using a service such as ScraperWiki (to which I will return). However, what is little commented upon is the restrictions put on access to data through many of the social networking platform APIs, in particular Twitter. The aim of this blog post is to address some of the issues around access to data and what we are permitted to do with it.
Restrictions to ‘free’ access to Twitter data
The restrictions imposed on access to data and their possible uses have a direct effect upon the kinds of questions one can ask of the data, and indeed the kind of research we can conduct. What are these restrictions? In the case of Twitter, there are two particular API access points of interest:
- The streaming API, and
- the search API.
These both come with particular kinds of restrictions, which have the potential to effect the amounts of data one can access. The streaming API effectively filters the full stream of all of the tweets being posted at any given time (named the ‘firehose’) down to 1% of the total (colloquially referred to as the ‘spritzer’) and the sampling method is not explained to users. As there are over 500,000,000 tweets per day, with an average in 2013 of 5,700 per second, 1% remains rather a lot of data. Nevertheless, as a sample it may be seen as problematic. For example, researchers have compared the 1% and firehose streams to statistically investigate how proportionate the ‘spritzer’ representation is of the full data set. Morstatter et al. (2013) suggest that for large datasets, or big issues that generate lots of traffic, the 1% is apparently fairly ‘faithful’ to the full stream, with a common set of top keywords and hashtags. However, for smaller datasets the spritzer appears to be a less faithful representation of all activity – this would mean researchers using the API would possibly need to be selective on the issues they study. Further, they suggest there is a ‘blackboxed’ bias in the 1% ‘spritzer’ API stream which diverges from random 1% samples they took from the ‘firehose’.
Image credit: kropekk_pl (Pixabay, Public Domain)
The search API is slightly more complicated. The data available is typically limited to the last week of activity, although for some search terms it may be slightly longer (this seems to vary). Access is governed by the number of requests to the API any given user can make in a set period (15 minutes). A user with an ‘access token’ can make 180 calls per 15 minutes fetching approximately 100 tweets per call. A user can utilise more than one access token but in their documentation Twitter allude to a limit on application-only authentication (without access tokens) of 450 calls per 15 minutes, so it might be reasonable to assume this is an absolute limit (I don’t have any experimental results to prove or disprove this).
As a thought experiment, if we assume that limit then the total amount of data accessible is 450 calls x 100 tweets per call, per four 15-minute periods (1 hour) = 180,000 tweets fetched per hour (in which period, on 2013 averages, 20,520,000 new tweets are added). Taken the other way around, if we assume that we can use lots of access tokens and we wanted to be opportunistic and harvest all tweets related to a phenomenon that occurred in the last three days with approximately 40,000,000 tweets in the corpus – we would need to collect all of those tweets in three days, as the oldest data is already three days old, and so we would need eight access tokens simultaneously gathering tweets, without any replication of data being harvested between them, for three solid days. There are two big assumptions here: first, we can use eight access tokens to harvest data at the maximum rate for 24 hours per day, without restriction; second, those accounts can be used so that only ‘fresh’ data is gathered, without replication across the eight.
In both forms of the API access to Twitter we may be forgiven for thinking there’s not much wrong, lots of data is available. However, when a researcher begins to ask questions that they would like to answer with that data particular kinds of problem can arise. By and large, to get to the maximum figures indicated for the API, above, one needs to implement a bespoke programme to ensure dedicated access in order to maximise the rate of data collection. Equally, using multiple ‘access tokens’ will, most likely, result in gathering some duplicate data, which will need to be filtered and refined.
In practice, when gathering data through the service ScraperWiki we often encountered rate limiting, which we were powerless to affect. Even with yourTwapperKeeper, for example, one needs to have better than average IT skills in order to implement an effective data collection method (see Bruns & Liang for an overview of what might be needed). This can, of course, be addressed by working with colleagues with the appropriate skills and may lead to interesting cross-disciplinary collaborations. However, should you wish to search the historical archive of tweets (for example: searching for tweets concerning the UK riots in 2011) this is not possible through the API and you will have to pay a commercial reseller of twitter data, or ‘certified partner‘ in the jargon, to get those data. Therefore, in order to have a chance at gathering data, researchers using the API need to be opportunist and set ‘scrapes’ of data running as close in time to the activities of interest as possible.
Equally, if one uses broad enough search terms it is entirely possible that the volume of tweets matching the criteria is such that it is not possible to harvest them before they drop out of the free-to-access pool of data before your search can reach them. Therefore, API-based data gathering for research is best suited to opportunistic highly specific searches (such as the UK badger cull), rather than topics that significantly trend (such as anything to do with an international celebrity).
At the beginning of the Contagion project we accessed the API through the easy-to-use third party online system ScraperWiki. With that system it was easy for us to set up ‘scrapes’ for tweets and search and order the data we retreived, download it and analyse it in various ways. However, earlier this year, ScraperWiki had their access to the Twitter API revoked. The tools for searching and collecting Twitter data were stopped and never reactivated. We have therefore had to seek alternative means of accessing data.
A political economy of ‘big data’
Perhaps the more serious issue to which this situation of access to data alludes is the proprietary nature of access, and indeed the data itself. While (largely unlimited) use of Twitter as a service is free to any user that signs up, access to the data on the platform is not. Twitter is, of course, a business. Just like many other ‘social’ platforms the data Twitter receives from its users is valuable and can be packaged as a commodity. There is therefore a political economy to this kind of ‘big data’ and accordingly political economic issues for ‘big data’ research.
Access is a commodity
If a researcher relies on the free API access to a platform, with its attendant vagaries of how much data one can access and for how long, then that researcher is at the mercy of the changing limits and access policies of that API. On the other hand, if one pays for access to data, to avoid the uncertainty of access (how much data and for how long), then expect to pay handsomely. Both main ‘certified partners’ that sell access to Twitter data, Datasift and Gnip (recently bought by Twitter), render access a commodity – you not only pay for the data but also for the processing power/time it takes to extract it and the ‘enrichments’ they add, by resolving shortened URLs for you, attributing sentiment to a given tweet (positive, neutral, negative) and so on.
The costs charged by ‘resellers’ of data are not insignificant in terms of typical research budgets, with some charging through a subscription model – requiring customers to commit for a minimum of six months. Twitter themselves have advertised their own ‘data grant‘ scheme, which came into operation this year, and offered a limited number of opportunities to access data through a competitive application process, not dissimilar to funding grant calls. Of the 1300 applicants only 6 (or 0.5%) were granted data (the numbers here come from this Fortune article).
Data are proprietary goods
The corollary to gaining access to proprietary data is that the license one agrees to abide by for access to Twitter data states that you cannot share that data. Therefore, investing in any form of data access (via the API or a ‘reseller’) through publicly funded research is problematic. For we are all asked to submit data attained in a publicly-funded project to data archives to allow other researchers to access and use it, which is prohibited by Twitter’s Terms of Service (1.4.1). As others have observed, it is possible to get around this by archiving only the unique ID code for each tweet and leaving it up to any future researchers to download the tweets using those IDs, thereby not breaching the Terms of Service. However, with the limits to the API outlined above, for a large corpus of tweets (> 1m, say) this might take a rather long time. A quick calculation suggests, using the status/lookup API, with one ‘access token’ it would take 13 hours 48 mins (at 100 tweets per request, 180 requests per 15 minutes = 72,000 tweets per hour) solid use of the API (without any hitches) to download 1 million tweets. Not impossible then, but perhaps significantly inconvenient – and reliant upon the system of unique IDs remaining the same for the foreseeable future. Furthermore, such restrictions may be suggested to run counter to the requirements set on research data gathered using UK research councils funds. The (UK) ESRC, who funded Contagion, have general principles in their Research Data Policy that suggest:
- Publicly-funded research data are a public good, produced in the public interest.
- Publicly-funded research data should be openly available to the maximum extent possible.
This asks difficult questions of us as researchers: Should we be using data sources (however potentially interesting/valuable) that restrict the capability of reproducing our research results? Should we be using public funds to pay for data that are restricted in such ways?
Not free, not easy
Some argue that conducting research using Twitter data has become something of a fad across academe, but in practice it proves neither to be easy (without non-trivial IT expertise and/or understanding of the policies of Twitter as a company), nor free: it requires investment in terms of hours of work (designing and/or operating systems to collect, store and analyse the data), it may require paid access (depending on what kind of sample of data you require), and it comes with usage restrictions.
This has led to the principal arenas of Twitter-based research occurring outside of the academy – a lot of data science, in fact, is conducted by commercial organisations. Whether or not this research is meaningful is open to interpretation. Nevertheless, it remains the case that, as others have suggested, an awful lot of (computationally-driven) social science is being done by ‘non-academic’ researchers, amongst whom there are significant numbers of people with advanced levels of relevant IT skills. However, I argue that one of the unfortunate effects of this shift in the locus of research is a lack of criticality.
One might convincingly argue, for example, that there is an awful lot of data visualisation for its own sake. It doesn’t necessarily argue anything, instead it describes an impressive amount of data in a visually appealing manner. Equally, there is tendency in some technically-led social research to assume that the context of data, or even the hypotheses one might pose and use that data to address, are secondary to its formatting or scale. For example, in a conversation with a sales person for a data provider I was advised that as a geographer I ought to study the picture sharing platform Instagram because that had the highest take-up of geo-located content. What that content represents, or what kinds of questions we can or might ask of it is therefore of secondary importance to the fact that there is geo-location metadata.
This is not to suggest that valuable ‘theory building’ research cannot be conducted through forms of data mining. We might not know the questions we can ask of the kinds (and scales) of data we are being faced with without performing exploratory analyses. Nevertheless, if we want to be surprised by the data (which may include concluding it is not particularly interesting for various reasons), as others have suggested, we surely need to implement critical forms of inquiry.
The point of this blog post is that to study social media data, and in particular Twitter data, is to concern oneself with emerging economies of data and their attendant politics. Rather than considering platforms like commercial social networking systems as easy and plentiful sources of research data, they require hard work: it is hard to gain access to that data (as non-technical and non-wealthy academic researchers); and: some hard critical epistemological reflection is required upon what can and cannot be asked and/or concluded given the specificities of each kind of dataset and data source we use. The means of access, the APIs and other elements necessary to access the data, are important interlocutors in the stories we tell with these data.
It remains possible to do particular kinds of research with the Twitter data one can access through the APIs, but we have to think pretty carefully about what kinds of questions we can and should ask of these data, and the system from which they are derived.
This article appeared at the LSE’s Impact of Social Science blog, and originally at the Contagion research project blog. Contagion is a social science project funded by the UK Economic and Social Research Council.
Please read our comments policy before commenting.
Note: This article gives the views of the author, and not the position of USApp– American Politics and Policy, nor of the London School of Economics.
Shortened URL for this post: http://bit.ly/1DboYPg
Sam Kinsley – University of Exeter
Sam Kinsley is a Lecturer in Human Geography at the University of Exeter and a Co-Investigator on the ESRC Transformative Social Science-funded project ‘Contagion’. His teaching, research and associated writing examine the cultural politics, material experience and spatial imaginations of technology.
This is a well researched post. I do think the 30-days of free Twitter data collection combine with the collaborative “peer network” features of DiscoverText to alleviate some of these issues. You can certainly generate Tweet ID lists for rehydration and replication purposes, but linked peers on DiscoverText can legally collaborate on a dataset as long as it remains on the system. We have cleared this activity with Twitter. DiscoverText makes collaboration and replication a concrete possibility for academics studying Twitter. The additional tools for random sampling, filtering, searching, measurement of human coder reliability, machine-learning, and automated duplicate detection make DiscoverText unique in the academic research space.