By Professor Felix Ritchie and Anthea Springbett
UWE Bristol has recently been commissioned by the Office for National Statistics (ONS) to develop a course in ‘output checking’ for research data centres. This is where researchers working on confidential data have their statistical outputs checked before publication, to ensure that they don’t break the law by inadvertently releasing information about individuals; for example, without proper checks a table of earnings in a small village could reveal the income of the highest earner. This checking process is called ‘statistical disclosure control’, or SDC.
Output checking is a well-established field, and there are experienced trainers and automatic tools to help those producing statistics. Why then does ONS need a new course? The reason is that new forms of data, new ways of working, and new types of users have all created a need for a different kind of output checking.
SDC training is largely focused on the tables produced by national statistical institutes (NSIs) such as ONS. NSI outputs have particular demands: similar tables are produced year after year, multiple tables are produced from the same data so consistency across tables is important, and NSIs publish a lot of information about their tables, including sampling methods.
Research outputs are quite different. Researchers aim to find new and interesting ways to extract meaning from data. Researchers choose data based on the hypotheses they want to explore and sub-samples of the population they are interested in, including or excluding data according to their own criteria. Finally, and most importantly, researchers don’t tend to produce detailed tables of the type NSIs generate; they are interested in multivariate analysis, non-linear models, heat maps, survival functions… For researchers, tables are often just used to describe the data before they get on to the interesting stuff. As a result, the forty-odd years of SDC designed for NSIs is of limited practical use in this environment.
For fifteen years, we have been developing an approach designed specifically for the research environment; we call it ‘output SDC’ (OSDC) to emphasise that this is a general approach to outputs, not just tables and not just for NSIs. There are two strands to this approach, one statistical and one operational.
The statistical strand comes from the ‘evidence-based, default-open, risk-managed, user-centred’ approach that we apply across our work in confidential data management. The way that researchers use data, and the confidentiality risks that they generate, require the output checker to be familiar with a wide range of statistics, which we address through classifying outputs into types, with a higher or lower inherent risk; this allows the checker to spend more time on the more ‘risky’ outputs. For these ‘risky’ outputs, context is everything. The traditional approach has been to apply simple yes/no rules (are there enough observations? Are there any outliers?) but this can be a very blunt instrument. Our approach emphasises the use of evidence in decision-making, which places more of a burden on the output-checker but increases the range of allowable outputs.
The operational strand reflects the fact that researchers are initially, on the whole, resistant to what they see as restrictions on output. A key part of the training will be helping output checkers build relationships with researchers; for example, emphasising that this is not about restricting output, but about keeping the researcher out of jail…
This is, we believe, the first formal course targeted specifically at (a) research outputs, and (b) those checking the outputs of the researchers, rather than the producers of statistics themselves. ONS is sponsoring the development of this training for all interested UK organisations. Many overseas organisations also run facilities that vet researcher outputs. We hope therefore that this will be of interest to a wide range of organisations, and may prompt a sea of change in the adoption of more general OSDC principles.