Checking research outputs for confidentiality risks

Posted on

By Professor Felix Ritchie and Anthea Springbett

UWE Bristol has recently been commissioned by the Office for National Statistics (ONS) to develop a course in ‘output checking’ for research data centres. This is where researchers working on confidential data have their statistical outputs checked before publication, to ensure that they don’t break the law by inadvertently releasing information about individuals; for example, without proper checks a table of earnings in a small village could reveal the income of the highest earner. This checking process is called ‘statistical disclosure control’, or SDC.

Output checking is a well-established field, and there are experienced trainers and automatic tools to help those producing statistics. Why then does ONS need a new course? The reason is that new forms of data, new ways of working, and new types of users have all created a need for a different kind of output checking.

SDC training is largely focused on the tables produced by national statistical institutes (NSIs) such as ONS. NSI outputs have particular demands: similar tables are produced year after year, multiple tables are produced from the same data so consistency across tables is important, and NSIs publish a lot of information about their tables, including sampling methods.

Research outputs are quite different. Researchers aim to find new and interesting ways to extract meaning from data. Researchers choose data based on the hypotheses they want to explore and sub-samples of the population they are interested in, including or excluding data according to their own criteria. Finally, and most importantly, researchers don’t tend to produce detailed tables of the type NSIs generate; they are interested in multivariate analysis, non-linear models, heat maps, survival functions… For researchers, tables are often just used to describe the data before they get on to the interesting stuff. As a result, the forty-odd years of SDC designed for NSIs is of limited practical use in this environment.

For fifteen years, we have been developing an approach designed specifically for the research environment; we call it ‘output SDC’ (OSDC) to emphasise that this is a general approach to outputs, not just tables and not just for NSIs. There are two strands to this approach, one statistical and one operational.

The statistical strand comes from the ‘evidence-based, default-open, risk-managed, user-centred’ approach that we apply across our work in confidential data management. The way that researchers use data, and the confidentiality risks that they generate, require the output checker to be familiar with a wide range of statistics, which we address through classifying outputs into types, with a higher or lower inherent risk; this allows the checker to spend more time on the more ‘risky’ outputs. For these ‘risky’ outputs, context is everything. The traditional approach has been to apply simple yes/no rules (are there enough observations? Are there any outliers?) but this can be a very blunt instrument. Our approach emphasises the use of evidence in decision-making, which places more of a burden on the output-checker but increases the range of allowable outputs.

The operational strand reflects the fact that researchers are initially, on the whole, resistant to what they see as restrictions on output. A key part of the training will be helping output checkers build relationships with researchers; for example, emphasising that this is not about restricting output, but about keeping the researcher out of jail…

Photo by Chris Liverani on Unsplash

This is, we believe, the first formal course targeted specifically at (a) research outputs, and (b) those checking the outputs of the researchers, rather than the producers of statistics themselves. ONS is sponsoring the development of this training for all interested UK organisations. Many overseas organisations also run facilities that vet researcher outputs. We hope therefore that this will be of interest to a wide range of organisations, and may prompt a sea of change in the adoption of more general OSDC principles.

Australia’s bold proposals for government data sharing

Posted on

By Felix Ritchie.

In August I spent a week in Australia working with the new Office of the National Data Commissioner (ONDC). The ONDC, set up at the beginning of July, is barely two months old but has been charged with the objective of getting a whole-of-government approach to data sharing ready for legislation early in 2019.

This is a mammoth undertaking, not least because the approach set out in the ONDC’s Issues Paper proposes a new way of regulating data management. Rather than the traditional approach of trying to specify in legislation exactly what may or may not be allowed, the ONDC is proposing a principles-based approach: this focuses on setting out the objectives of any data sharing and the appropriate mechanisms by which access is governed and regulated.

In this model, the function of legislation is to provide the ground rules for data sharing and management within which operational decisions can be made efficiently. This places the onus on data managers and those wanting to share data to ensure that their solutions are demonstrably ethical, fair, appropriate and sensible. On the other hand, it also frees up planners to respond to changing circumstances: new technologies, new demands, shifts in attitudes, the unexpected…

The broad idea of this is not completely novel. In recent years, the principles-based approach to data management in government has increasingly come to be seen as operational best practice, allowing as it does for flexibility and efficiency in response to local conditions. It has even been brought into some legislation, including the UK’s Digital Economy Act 2017 and the European General Data Protection Regulation. Finally, the monumental Australian Productivity Commission report of 2017  laid out much of the groundwork, by providing an authoritative evidence base and a detailed analysis of core concepts and options.

In pulling these strands together, the ONDC proposals move well beyond current legislation but into territory which is well supported by evidence. Because of the unfamiliarity with some of the concepts, the ONDC has been carrying out an extensive consultation, some of which I was able to observe and participate in.

A key proposal is to develop five ‘Data Sharing Principles’, based on the Five Safes framework (why, who, how, with what detail, with what outcomes) as the overarching structure. The Five Safes is the most widely used model for government data access but has only been used twice before to frame legislation, in the South Australia Public Sector (Data Sharing) Act 2016 and the  UK Digital Economy Act 2017.

The most difficult issues facing the ONDC arise from the ‘why’ domain: what is the public benefit in sharing data and the concomitant risk to an individual’s privacy? How will ‘need-to-know’ for data detail be assessed? What are the mechanisms to prevent unauthorised on-sharing of data? How will shared data be managed over its lifecycle, including disposal? To what uses can shared data be put? Can data be shared for compliance purposes? How can proposals be challenged?

These are all good questions, but they are not new: any ethics or approvals board worth its salt asks similar questions, and would expect good answers before it allows data collection, sharing or analysis to proceed. A good ethics board also knows that this is not a checklist: ethical approval should be a constructive conversation to ensure a rock-solid understanding of what you’re trying to achieve and the risks you’re accepting to do so.

This is the also the crux of the principles-based approach being taken by the ONDC: it is not for the law to specify how things should be done, nor to specify what data sources can be shared. But the law does provide the mechanisms to ensure that any proposals put forward can be assessed against a clear purpose test around when data may and may not be shared and that appropriate safeguards are in place…

Finally, the law will require transparency; this has to be done in sunlight. A public body, using public money and resources for the public benefit, should be able to answer the hard questions in the public arena; otherwise, where is the accountability? The ONDC will require data sharing agreements to be publicly available, so people can see for what purpose (and with what associated protections) their data are being used.

To some, this need to justify activities on a case-by-case basis, rather than having a black-and-white yes/no rule, might seem like an extra burden. The aim of the consultation is to ensure that this isn’t the case. In fact, a transparent, multi-dimensional assessment is any project’s best friend: it provides critical input at the design stage and helps to spot gaps in planning or potential problems, as well as giving opponents a clear opportunity to raise objections.

Of course, even if the legislation is put in place, there is still no guarantee that it will turn out as planned. As I have written many times (for example in 2016), attitudes are what matter. The best legislation or regulation in the world can be derailed by individuals unwilling to accept the process. This is why the consultation process is so important. This is also why the ONDC has been charged with the broader role of changing the Australian public sector culture around data sharing, which tends to be risk-averse. The ONDC also has a role to build and maintain trust with the public through better engagement to hear their concerns.

From my perspective, this is a fascinating time. The ONDC’s proposals are bold but built on a solid foundation of evidence. In theory, they propose a ground-breaking way to offer a holy trinity of flexibility, accountability, and responsibility. If the legislation ultimately reflects the initial proposals, then I suspect many other governments will be beating a path to Australia’s door.

All opinions expressed are those of the author.