By Professor Felix Ritchie and Anthea Springbett
UWE Bristol has recently been commissioned by the Office for National Statistics (ONS) to develop a course in ‘output checking’ for research data centres. This is where researchers working on confidential data have their statistical outputs checked before publication, to ensure that they don’t break the law by inadvertently releasing information about individuals; for example, without proper checks a table of earnings in a small village could reveal the income of the highest earner. This checking process is called ‘statistical disclosure control’, or SDC.
Output checking is a well-established field, and there are
experienced trainers and automatic tools to help those producing statistics.
Why then does ONS need a new course? The reason is that new forms of data, new
ways of working, and new types of users have all created a need for a different
kind of output checking.
SDC training is largely focused on the tables produced by
national statistical institutes (NSIs) such as ONS. NSI outputs have particular
demands: similar tables are produced year after year, multiple tables are
produced from the same data so consistency across tables is important, and NSIs
publish a lot of information about their tables, including sampling methods.
Research outputs are quite different. Researchers aim to
find new and interesting ways to extract meaning from data. Researchers choose
data based on the hypotheses they want to explore and sub-samples of the
population they are interested in, including or excluding data according to
their own criteria. Finally, and most importantly, researchers don’t tend to
produce detailed tables of the type NSIs generate; they are interested in multivariate
analysis, non-linear models, heat maps, survival functions… For researchers,
tables are often just used to describe the data before they get on to the
interesting stuff. As a result, the forty-odd years of SDC designed for NSIs is
of limited practical use in this environment.
For fifteen years, we have been developing an approach
designed specifically for the research environment; we call it ‘output SDC’
(OSDC) to emphasise that this is a general approach to outputs, not just tables
and not just for NSIs. There are two strands to this approach, one statistical
and one operational.
The statistical strand comes from the ‘evidence-based,
default-open, risk-managed, user-centred’ approach that we apply across our
work in confidential data management. The way that researchers use data, and
the confidentiality risks that they generate, require the output checker to be
familiar with a wide range of statistics, which we address through classifying
outputs into types, with a higher or lower inherent risk; this allows the
checker to spend more time on the more ‘risky’ outputs. For these ‘risky’
outputs, context is everything. The traditional approach has been to apply
simple yes/no rules (are there enough observations? Are there any outliers?)
but this can be a very blunt instrument. Our approach emphasises the use of
evidence in decision-making, which places more of a burden on the
output-checker but increases the range of allowable outputs.
The operational strand reflects the fact that researchers are initially, on the whole, resistant to what they see as restrictions on output. A key part of the training will be helping output checkers build relationships with researchers; for example, emphasising that this is not about restricting output, but about keeping the researcher out of jail…
This is, we believe, the first formal course targeted specifically at (a) research outputs, and (b) those checking the outputs of the researchers, rather than the producers of statistics themselves. ONS is sponsoring the development of this training for all interested UK organisations. Many overseas organisations also run facilities that vet researcher outputs. We hope therefore that this will be of interest to a wide range of organisations, and may prompt a sea of change in the adoption of more general OSDC principles.