This article originally appeared in the March issue of Research Matters magazine, published quarterly by the Social Research Association (SRA). Back issues are made available to non-members three months after publication.
The social sciences have seen increasing interest in the use of social media data in research. Researchers may, for example, be attracted by large sample sizes; the ability to reach rarer populations; the quantity and richness of data; or the speed and accessibility of data that some platforms may offer. However, as with any research methodology, the use of social media data has its challenges. For example, social media data are not created for research purposes: people may not produce content on the topic a researcher is interested in, or key analysis variables may be missing. There are also concerns about the representativeness of social media data: are social media users representative of a wider population? Are the data biased towards the most vocal? And how do you ensure a sample only includes the target population (for example adults in the UK and not Russian ‘bots’)?
Along with researchers from Cardiff University and the University of Essex, I have been looking at the feasibility of linking survey and social media (specifically Twitter) data to enhance both data sources and address some of these challenges. As with other forms of data linkage, survey data can benefit from additional data covering areas not included in the original questionnaire, perhaps because of their complexity, because they were not initially a topic of interest, or due to space limitations. At the same time, social media data can benefit from the structure and direction of survey data. For example, if we are confident in the quality of our survey sample, we can be confident that our social media data are also from our target population. We can also begin to analyse how the social media data vary between different groups of interest – such as different age or income groups – assuming that information is collected in the survey. The linked data also provide the opportunity to improve existing approaches – social media researchers can use a linked dataset to help validate classification algorithms against ‘ground truth’ survey data, while longitudinal survey researchers can use the data for inter-wave measurement or non-response adjustments.
Feasibility and practicalities
So far, we have collected consent to link data for two different nationally representative surveys: Wave 10 of the UK Household Longitudinal Study Innovation Panel (UKHLS IP) and the July 2017 wave of the NatCen Panel, which respectively use sequential mixed-mode web/face-to-face and web/telephone fieldwork designs. In both cases, all participants with a Twitter account were provided information in the survey about the data we would like to collect and why, and what we planned to do with it; they were also asked if they would give their consent to collect their Twitter data and link it to their survey data and their Twitter handle. With informed consent and their Twitter handle, participants publicly available Twitter data can then be linked to their survey data.
Collecting informed consent is a key methodological challenge. Striking a balance between providing enough information so that consent is ‘informed’ while not overwhelming participants is a challenge for all studies requiring consent for data linkage. We have found that consent rates have been relatively low (compared to, for example, government administrative records). Only 27% of Twitter users on the NatCen Panel and 31% on the UKHLS IP agreed to their data being collected and linked. This can limit statistical power and risks introducing non-response bias into the sample, although our initial analysis suggests that few sociodemographic characteristics consistently impacted consent outcomes. However, other studies have achieved consent rates as high as 90%, suggesting that this can be addressed.
Another practical issue is how to analyse the data securely. In their raw form, Twitter data (and any survey data they are linked to) are identifiable even when pseudonymised; Tweet text can be searched online to identify an individual, but its anonymisation would undermine its utility. Instead, we recommend focusing on the systematic processing of the data, using data reduction, controlled access, and data deletion to ensure its security.
Our research so far has demonstrated the feasibility of linking survey and Twitter data, overcoming some challenges but also finding new ones. Looking ahead, we will continue to explore the value of this approach to social researchers as part of our ESRC-funded project, ‘Understanding [Offline/Online] Society: Linking Surveys with Twitter Data’. We aim to explore the issues and concerns people might have about these data being linked for research purposes, as well as investigating how datasets can be securely archived and made available to the research community.
So far, although we have conducted some simple analysis of election data to demonstrate the approach, few studies have applied this approach in a substantive context. As part of this project, we will be applying the methodology in a study of public attitudes to minority ethnic groups to demonstrate the potential of the approach, as well as ‘live testing’ the theoretical approaches we have developed.
Looking further ahead, we want to move ‘beyond Twitter’. Twitter’s open API (application programming interface), relative popularity and ‘public broadcast’ nature make it well suited for analysis, but it forms only part of many people’s digital lives. Including other social media platforms, apps, websites and devices would give a more holistic perspective. In addition, we shouldn’t assume that Twitter will continue to exist in its current (or any) form in the future. As such, the work we are doing should be viewed as part of broader methodological research combining survey data with publicly available, identifiable data.
 Al Baghal, T., Sloan, L., Jessop, C., Williams, M., and Burnap, P. (2019). ‘Linking Twitter and Survey Data: The Impact of Survey Mode and Demographics on Consent Rates Across Three UK Studies’. Social Science Computer Review.
 Wojcik, S., Hughes, A. and Cohn, S. (2019). ‘Sizing Up Twitter Users’. Pew Research Centre.
 Sloan, L., Jessop, C., Baghal, T., and Williams, M. (2019). ‘Linking Survey and Twitter Data: Ethics, Consent, Anonymity, Archiving and Sharing’. Journal of Empirical Research on Human Ethics.