Elizabeth Green1, Felix Ritchie 1, Libby Bishop2, Deborah Wiltshire 2, Simon Parker 3, Allyson Flaster 4 and Maggie Levenstein4
1The University of the West of England, 2GESIS, 3DKFZ German Cancer Research Center, 4University of Michigan
When carrying out research with confidential quantitative data, there is much support for researchers. There is ample advice on best practice for collecting (remove identifiers as soon as possible, only collect statistically useful information), a vast literature on how to reduce the risk in microdata (swapping, top coding, local suppression, rounding, perturbation, …), and a small but effective literature on how to prevent statistical outputs from the residual disclosure risk (eg a combination of tables showing that a one-legged miner in Bristol earns £50k a year).
For qualitative data, there is much less guidance. At the collecting/storing stage there is clear good practice (such as removing direct identifiers), although it may be hard to separate out analytically vital information from contextual information. In particular, the trade-off between anonymization and fitness for use may be much sharper for qualitative than quantitative data. Improvements in natural language processing (NLP) have enabled the development of anonymization tools in the UK and in Germany (e.g., QualiAnon) for qualitative data. However, when producing analyses there appears to be little or no general guidance on output disclosure control (ODC), and researchers are required to rely on informal advice and rules of thumb for good practices. This challenge is exacerbated by the wide variety of genres of qualitative data which makes guidance difficult to generalise.
Why the lack of practical guidance for output checking of qualitative data when there is a well-established set of guidelines for quantitative data? In one perspective, the lack of guidelines is not surprising. Guidelines for quantitative data were almost exclusive developed to meet the needs of national statistics institutes (NSIs), and thence filtered down to trusted research environments (TREs, secure research facilities usually specialising in quantitative data). Outside of NSIs and TREs, knowledge of output disclosure control is very limited, not even making it onto the syllabi of research methods courses. In this context, perhaps it not surprising that there are no guidelines for qualitative data: guidelines for quantitative research only appeared because of economies of scale and scope, and have remained largely in the environment in which they were developed.
The need for qualitative data ODC guidelines has five drivers. First, there is a greater awareness of the need to maintain confidentiality, driven by legislation and regulation. Every journal has to trust that researchers have anonymized enough/not too much, but no metrics exist for how to assess this. Second, the lack of consistent guidelines means each generation of researchers must develop their own rules, which is inefficient and increases the likelihood of error. Third, the increased used of NLP tools has increased the number and types of researchers who are working with qualitative data. Finally, the development of TREs offers great opportunities for very detailed, unredacted qualitative data to be shared easily whilst ,maintaining security; but this must be supported by clear disclosure guidance for outputs. Whilst most TREs have a policy of checking outputs for residual risk, it is not cleared whether the skills, resources and processes exist to do this for qualitative data.
UKDA provides some guidelines and also a tool for anonymising qualitative outputs; however the approach focuses on removal of direct identifiers and does not address nuance or contextual identifiers. The CESSDA Data Management Expert Guide provides a worked example of transcript anonymisation.
Kaiser (2009) outlines deductive disclosure in which the contextual nuance of a situation allows for an individual to be identified. Kaiser suggests that in order to address this issue, researchers should discuss the use of the research with participants, including describing in the consent to participate how the data will be made available to researchers, the types of research that are permissible, and the protections that will be in place for both the original research team and secondary analysts. However, this may not be possible with some types of data, and this also assumes that the discussion leads to genuinely informed consent rather than meeting a procedural tickbox.
ODC of quantitative data is conceptually straightforward. While quantitative data may be very highly structured (eg multilevel multiperiod data on GPs, patients and hospitals) or highly unstructured eg quantitative textual analysis), all quantitative data can be seen, ultimately, as tables of numbers used to produce summary data. The same ODC rules can be applied in all cases.
In contrast, qualitative data are varied in both content and structure; examples could be
- Interview recordings/transcripts
- Written responses in surveys
- Psychiatric case studies
- Videos and images
- Ethnographic studies
- Court records
- Social media text
In each of these cases, protecting confidentiality may require different solutions. In psychiatric case studies information may remain identifiable when published, but informed consent is used to agree to the higher level of re-identification risk. In interview responses, redaction may be a very effective response; in videos, pixilation. In court records and social media, the semi-public nature of the source data may cause difficulties particularly around de-identification. Future technology is likely to throw up more options, such as digital behaviour data, or currently unimaginable data types.
Approaches to solutions
Given the range of qualitative data types, it seems unlikely that universal rules could be developed. However, there may be ways to develop general solutions (frameworks?)
- Method-specific solutions
- Data type-specific solutions
There may also be solutions which involve both input controls (eg consent) as well as output (redaction) methods. This may allow us to sidestep the question of what is permissible, but it does not address what is ethical to disclose.
If considering types of qualitative data output, it may be useful to consider where the value is generated. For example, e.g. in a video recording, nuances of the subject’s body language may be more important than the words; if so, redaction of text is possible, but pixilation isn’t. Understanding the research value may direct outputs of the same type towards different solutions. Thus it is important to distinguish between disclosure of “raw” data that researchers interact with and disclosure of outputs that are available to the general public in an unregulated environment. It is also important to focus on fitness for use, as certain kinds of disclosure control may degrade data quality but in ways that do not affect its use for certain types of analsysis, while making the data essentially worthless for other types of analysis.
Guidelines should address the different types of data, the accessibility of the data, and the intended use of the data in order to develop broader organising principles.
There may also be a need to create some definitions to provide a common language for discussing risks and solutions.
Given the lack of consensus as to how to ensure safe outputs from qualitative analysis, the most productive first step may be a webinar with a credible global audience of interested parties to explore some of the issues raised here. Ideally, the audience would include both researchers working with a variety of qualitative data types and some data protection and confidentiality specialists.
The aim of the webinar would be to develop a programme of work – the initial step in this programme could be the formation of study groups each focused on a particular type of data. These groups could then report back during a second workshop. In addition the workshop could consider how to share guidelines amongst the research community i.e. embedding good practice into Research Methods courses, or Data Management Plans, with the intent of avoiding following the quantitative route of concentrating disclosure control training in limited environments. It may also be helpful to explore funding opportunities if this looks to be a significant programme of work.
The workshop will take place on Friday 10th December 15:00-17:30 via Microsoft teams
To register for the event, please click here
Any questions and queries please contact Lizzie Green email@example.com