Posted on 12 December 2018 by Guy Goodwin, Chief Executive
The themes explored at the Cathie Marsh Memorial Lecture 2018.
Today’s a great day to be a statistician and a dreamer.
As someone who believes good quality evidence can improve people’s lives, I am a fan of the Census and our national social surveys, excited that we’re making more use of administrative data and believe that our work with unstructured, so-called “big” data can be transformative.
I love the idea that we’re going to be linking more data and doing more modelling and that statisticians will be the sexiest jobs in the next decade … that’s only just over a year away (Hal Varian, Google). It has also been suggested that “Data is the new oil” (Clive Humby et al) although I still need a bit more convincing.
Generally, it’s difficult not to be optimistic and excited about the future for statistics. But am I right to be? Or am I more like Scott Fitzgerald’s Gatsby stretching out towards the green light at the end of Daisy’s dock?
I was delighted, against this backdrop, to be asked to provide some thoughts on where surveys might fit into the brave new world ahead, if at all, as part of the Cathie Marsh Memorial Lecture at the Royal Statistical Society. It was fitting to be doing this at our community’s annual tribute to Cathie who promoted the survey method (see “The Survey Method, the Contribution of Surveys to Sociological Explanation”, Catherine Marsh, 1982).
It was also fitting to be doing this a week after the publication of NHS Digital’s extraordinary showcase for the social survey - the national survey of the mental health of children and young people. It shows the scale and prevalence of mental health issues in England, analysable by a formidable range of pre-specified, tailored factors (such as e-cigarette use, bullying, drugs, social media use, exclusion from school) and picked up by the full range of interested parties including policy makers.
The data collected show that a quarter of those children with mental health issues had contact with a mental health specialist, almost a half had informal support of some type and a significant minority had no support.
The survey itself represents a challenge to those occasional siren voices in my head that say that … just perhaps … some time in the future … we may be able to “bottle up the complexity of our human lives” and pre- or post-populate social surveys using administrative or big data in the same way as we would like to for our business and school surveys, seeing them as a source of last resort.
Indeed, I recall with rather fond memories, during student days at the London School of Economics, one of my colleagues comparing surveys to Heineken beer … “they reach the parts of our lives that other sources can’t reach”.
Definition of a survey
Cathie Marsh had a long and short definition of the survey in her book “The Survey Method” and the condensed version:
“… an inquiry which involves the collection of systematic data across a sample of cases and the statistical analysis of the results”
is perfectly fine for the purposes of this paper. I am also going to restrict my comments to high quality social surveys, the rather expensive type.
Where are we today?
The reality currently is that:
- Surveys are increasingly being perceived as unfashionable, cost being the key driver but not the only one. Ideology is an issue too, as well as concerns around the public’s willingness to co-operate voluntarily;
- They are still being widely used by customers.
For example, last month, the Guardian informed us that “UK millennials’ costs are among EU’s highest - But pay lags behind”
The Guardian conscientiously lists numerous caveats with the analysis: the data are not of Office for National Statistics’ quality, they’re not Eurostat data, they are simply data from Revolut’s customer database so they may not be representative or generalisable but “importantly, the data is not survey evidence but real averages”. That’s ok then. Let’s compare 1.3 million UK customers with 100 thousand or so Swiss ones and generalise the findings anyway.
I note that surveys are a particular challenge to those who have, what I think of as, a “Gigantism” view of the world, those who feel that the data revolution is going to sweep everything else from the past or present away in its path, who start from an exclusion mentality rather than an integrated sources model of the future. The latter I continue to believe has more potential.
It’s also worth logging the on-going and rather remarkable bond established over the years between the social survey researcher and the policy department in the UK which is largely based around the level of control we have with the survey vehicle so that it can be tailored to the questions the customer wants answered.
The Survey in the Future
As for the future, certainly over the next decade, I would argue that for our surveys and longitudinal studies to continue to thrive in inevitably a more congested data space, we will see:
- Different types of surveys, collected in new more innovative ways, including a data collection mode change (to web/mobile first), with new use of technologies (such as “wearables”);
- A repositioning and reclaiming of the survey for social research. There is a reason why Moser and Kalton called their book “Survey Methods in Social Investigation” not “Survey Methods to collect basic characteristics on the population in between censuses because our Census is only once every 10 years and the quality of our administrative data is poor”. For example, the reason why we decided to collect data on religion on the Labour Force Survey in the UK was nothing to do with a desire to analyse the labour market by religion; it was simply to get basic counts from our largest household survey in between Censuses. As administrative and big data become of better quality, you would hope that you might increasingly get basic population data from them, such as on ethnicity, religion, language and sexual identity; currently, it remains a challenge;
- The linking of survey and longitudinal study data with administrative data records to add to their utility and value added. In my view, the value of such linkages is currently under-estimated; linking different sources to maximise combined strengths has significant potential.
The Role of the Survey
For those in our community who are more cynical about the future of the social survey, I would challenge you to evidence it. I suspect that is quite difficult at the current time, although it might become easier as big data develops.
Take any of our large complex national surveys and put the variables collected into three categories:
1. Those we can pre- or post-populate now;
2. Those we are likely to be able to in the future;
3. Those we may struggle with for some time to come.
I think we can agree that relatively little fits into the first category yet and there is still much to do to change that. The bigger challenge though may be the variables that are possibly in the third category and their value to survey commissioners.
Take health as an example and the sorts of questions we currently answer from our national surveys:
- What proportion of the population is obese?
- Measurement of disease prevalence in the population (diagnosed & undiagnosed);
- Is the increase in the prevalence of diabetes a ‘real’ change or are we just diagnosing more?
- What proportion of the population suffers from depression?
- Measuring the impact of the smoking ban on children’s exposure to passive smoke?
- Who is most likely to have undiagnosed hypertension and how can health services target screening?
- Understanding the social determinants of health using rich demographic data;
- How does loneliness impact on health?
- The association between behavioural & risk factors and health outcomes;
- International comparisons of health condition prevalence between countries with different health care systems.
And these are just some questions about the what, who, how … and not “Why?”.
It’s lovely to lead an organisation, the National Centre for Social Research (NatCen), that cares so much about the “Why” because of its relationship to policy interventions and impact.
Surveys are good, of course, around the “why?” but we perhaps don’t probe these questions as much as we could or should. Nor does the survey have a monopoly here. It is plausible to find out “why?” through qualitative approaches and, in some cases, through big data.
Take an example: I am overweight but not obese. My GP records will confirm this, at my 50th birthday check-up, and some encrypted version of that information could already have found its way onto our national statistical databases. But the administrative data won’t tell you why and it is crucial if you’re interested in policy interventions. It’s also not difficult to find out from asking a small number of probing questions. Certainly, one reason results from my dislike of throwing food away and, as someone who also lives on their own, it’s not difficult to see how I accumulate potential food waste. At least, this is currently how I justify eating a cold Yorkshire pudding and half a toffee cheesecake from the night before for breakfast (waste not, want not!).
Our lives are getting ever more complex. This is perhaps no better exemplified by the number, types and nature of our family formations and relationships as we move away from monitoring marital events to a wider range of concepts including cohabitating, divorces, re-partnering, living apart together, civil partnerships and so on.
Surveys can really help with complexity … because they have that undoubted advantage of you being in control of what you collect and being able to tailor the design to the end uses, rather than data resulting as a by-product of an administrative process. Linkage of sources, including administrative data to survey sources, can then bring substantial added value.
Advantages and disadvantages of the best surveys
The advantages of the survey method of data collection are well rehearsed in the literature and I don’t feel qualified enough to add to those. But I note that they must not be underestimated because they are often not evident in other sources.
For example, data from all sources will typically have some level of uncertainty and bias attached to them, but you will not normally be able to produce an effective statistical measure of the level of that uncertainty or adjust for bias, as you can do for the best surveys.
Some of the advantages and disadvantages are as follows:
- Simple, Control over design & handle complexity
- Ensure consistency over time (or measure discontinuities)
- Sampling - frame, representative, don’t need to ask everyone
- Quality & measure uncertainty/ give confidence levels in results
- Adjust for non-response bias
- Hard to reach groups
- Small area/sub-group estimates
- Cost and timeliness
- Response and Non-response bias/Response rates
- Hard to reach groups
While the disadvantages of surveys are undoubtedly real, those highlighted above are worthy of an enhanced level of debate in the future and some scrutiny - for example, the relationship between marginal increases in response rates and non-response bias. It is likely that the current survey approaches are not yet optimal in terms of quality and cost.
Timeliness is another frequently quoted disadvantage that is worthy of attention, especially around the current obsession with real-time data and a presumption that social survey data that are a couple of years old are out of date. In practice, people are not share prices and society often changes slowly over many years, indeed decades. For example, we have seen a remarkable change in my lifetime in the public’s attitudes to same sex relationships according to NatCen’s British Social Attitudes Survey … but it is still a change over three decades.
It goes without saying that we need real-time data when they need to be used in real-time, such as:
- Real-time monitoring of services (for example, a patient being treated for a heart condition);
- Trying to track terrorism activity on the web and stop it or to track child sex offenders using online chat rooms;
- Enhancing your winning chances at gambling activities.
Of course, administrative and big data may well additionally help (on top of the usual management information) in identifying, say, new epidemics at an early stage or short-term turning points and trends in series.
Technology will enable surveys to be turned around much quicker in the future, indeed within three to six months in most cases (so the constraints will be normally expense and expertise, rather than technology). For most purposes, those turnaround times are not problematic for the policy maker or social commentator. It’s difficult to argue that issues such as demographic growth, inequalities, obesity, mental health, poverty, ageing are moving fast or unpredictably in real time.
Big Data & Administrative Data
We should recognise that we still have much work to do to create the basis for an integrated sources model. Issues around provenance, governance, frameworks, quality, methods, people and skills, data ethics and privacy persist.
Even though our technology may have advanced, the issues with administrative data are often more about politics and policies, the law, their scope and quality.
The truth is that our administrative data research centres didn’t work as we wanted them to and there are few labour market statisticians punching the air in delight as the new universal credit system is being rolled out. Indeed, twenty years on from the White Paper “Building Trust in Statistics”, which I helped write, how many of us have even now got access to the microdata for key administrative datasets for research purposes? Access may come of course, through the recent Digital Economy Act, but we are just starting on a new part of our journey.
The excellent paper by Professor David Hands “Statistical Challenges of administrative and transaction data”, presented to the RSS in November 2017, identifies 15 specific challenges:
These are not insurmountable problems, necessarily, but they are considerable challenges all the same.
My own experience overseeing ONS’s crime statistics evidenced the scale of these challenges. Even one police force area putting out a new guidance note can bring about a discontinuity in the data. Sorry for the joke - but how do you police that?
But we have good reason to be hopeful …
There are wonderful opportunities for statisticians and data scientists looking forward. Technology will allow us to do amazing things with large volumes of data to improve people’s lives and the linkage of different sources will help advance this agenda. We should not lose sight of that.
I’d like to pay tribute to the many people in our community working with surveys and longitudinal studies for the incredible insights they already give into life in the UK. I’m confident surveys and longitudinal studies are going to remain a key part of our infrastructure for some time to come and we should welcome that, indeed embrace it.
It’s not a question of “either … or” but a question of “both … and more” when it comes to administrative & big data and surveys.
After saying that, I note that we can still do with surveys much of what many administrative data can do in relation to population characteristics at the national level (but not usually at small area level). The reverse is not yet true and therefore the survey may well retain a degree of primacy with our policy customers short term. So, is there a future ahead for surveys? Yes, of course.
There is a risk that data saturation in the future may both devalue quality and lead to its reduction over time. I sense that whether the customer retains being king or queen in the future (our data being demand or supply driven) will become increasingly an issue as we use administrative data more. To what extent will our customers have to self-serve from what’s available on our integrated databases, rather than specify what they need?
Whether our brave new statistical world fully delivers what we want – a coherent but highly relevant integrated statistical system – is still to be proven.
We should note that there is much work to do before our excitement can be turned into reality.