Training Researchers to Work with Confidential Data: A New Approach

Posted on

Prof Felix Ritchie of UWE’s Business School has recently spent time with the Northern Ireland Statistics and Research Agency and makes the following analysis.

I’ve just spent two days at the Northern Ireland Statistics and Research Agency (NISRA), working with them to develop training for researchers who need access to the confidential data held by NISRA for research. This training is jointly being developed by the statistical agencies of the UK (NISRA, the General Register Office for Scotland, and the Office for National Statistics in England and Wales), as well as HMRC, the UK Data Archive and academic partners. The project is being led by ONS as part of its role to accredit researchers under the new Digital Economy Act, with UWE providing key input; other statistical agencies, such as INSEE
in France and the Australian Bureau of Statistics, are being consulted and are trialling
some of the material.

Training researchers in the use of confidential data is common across statistical agencies around the world, particularly when those researchers need access to the most sensitive data only available through Controlled Access Facilities (CAFs). The growth in CAFs in recent years has mostly come from virtual desktops which allow researchers to run unlimited analyses while still operating in an environment controlled by the data holder. There are now six of these in the UK, and many countries in continental Europe, North America and Oceania operate at least one. The existence of CAFs has led to an explosion in social science research as many things that were not previously allowed because it was too risky to send out data (such as use of non-public business data, or detailed personal data) have now become feasible and cost-effective.

All agencies running CAFs provide some training for researchers; around half of these use ‘passive’ training such as handouts or web pages, but the other half require face-to-face training. Much of this training has evolved from a programme developed at ONS in the UK in the 2000s and this training was recommended as an example of ‘best practice’ for face-to-face training by a Eurostat expert group.

However, this style of training is showing its age. Such training typically has two components: firstly how to behave in the CAFs and secondly how to prevent confidential data from mistakenly showing up in research outputs (‘statistical disclosure control’, or SDC). Both are typically taught mechanistically, in the form of dos and don’ts, explanations of laws and penalties and lots of SDC exercises. Overall the aim of the courses is to impart information to the researcher.

The new training is radically different from the old training. It starts from the premise that researchers are both the biggest risk and the biggest advantage to any CAF: the biggest risk because a poorly-trained or malcontented researcher can negate any security mechanism put in place; the biggest advantage because highly-motivated researchers means cheaper system design, better and more robust security and the chance for the data holder to exploit the goodwill of researchers in methodological research, for example.

In this world the main aim of the training is to encourage the researcher to see himself or herself as part of the data community. If this can be established then the rest of the training follows as a consequence. For example, knowledge of the legal environment or SDC is shared not because it keeps you out of jail but because everyone needs to understand this so the community as a whole works. This gives the course quite a different feel to more traditional courses: much of the day is spent in open-ended facilitated discussions exploring concepts of data access.

The training was designed from the ground up in order to take advantage of recent developments in thinking about data access and SDC. This was also done to avoid being restricted by having to ‘fit’ preconceived ideas about what worked or not; material was included on its own merits, not whether “this was what we used to do…”. For example, the previous SDC component had a large number of numerical examples, developed over many years, leading to attendees remarking on afternoons spent “doing Sudoku”. We reviewed every example to identify the minimum set of principles needing to be explored and then wrote a small number of new examples based on this minimum set. On the other hand, the previous training had relatively little to say about the context for checking outputs for confidentiality breaches; this has now been expanded as it fits with the ethos of understanding why things are done.

Of course, this was not all plain sailing. The original structure, trialled in June 2017, had just one presentation before being comprehensively abandoned. Modules have dropped in and out and been moved around. The initial test for the course has been completely rewritten (a topic for a later blog). Various sections have been inserted as ‘options’ to take account of regional variations in operating practices. Throughout this, multiple organisations have been able to feed into the process so that the final product itself has a sense of community ownership.

We are now at the stage of training-the-trainers to enable independent delivery around the UK. This is already generating much feedback for the future development of the course: for example, a need has arisen for ‘crib sheets’ to help in the facilitation of certain exercises. Overall, however, we are confident that we have a well-structured, informative, course that meets the needs of 21st century data training.

Further reading: for more information on the evidential and conceptual basis for the course, see Ritchie F., Green E., Newman J. and Parker T. (2017) “Lessons Learned in Training ‘Safe Users’ of Confidential Data“. UNECE work session on Statistical Data Confidentiality 2017. Eurostat. 

Back to top