flowchart LR A(A \n Detect PII) --> B( B \n Mask PII) B --> C( C \n Assess disclosure\n risk and utility) C --> D( D \n Release sanitized text)
1 Why is anonymising text so difficult?
The goal of text anonymisation, also called text de-identification and text sanitisation, is to edit a text documents such as to conceal the identity of the person(s) which may be mentioned or referred to. This is typically done in a series of steps:
Detect PII: Search the text for occurrences of Personally Identifiable Information (PII) such as person names, adresses or other information that may result in confidentiality disclosure. Occurences may be single words, phrases or even passages.
Mask PII: Problematic PII must be either masked (removed) or replaced by alternatives in the text.
Assess disclosure risk and utility: Check if the edits you have made really reduce the disclosure risk of to a satisfactory level, without distorting the semantic content of the original document. If not, rinse and repeat.
Release sanitized text: If disclosure risk and utility are deemed satisfactory, release the de-identified text.
From a birds-eye perspective, the steps in this pipeline are common to most anonymisation tasks, and many well-estabslished concepts and features, such as direct and indirect identifiers or the privacy-utility trade-off, are equally at play in a text de-identification pipeline. That being said, text comes with its own set of unique challenges, some of which are described below:
Text is a type of unstructured data, and may convey personal information in a myriad of ways, without following a predefined format.
Many words are ambiguous - that is, their meaning may vary depending on the context of their occurrence. This is why we cannot simply de-identify a text by creating a large vocabulary of words/phrases to remove.
Text documents often include various “cues” that provide indirect information about one or more individuals mentioned in the document.
Standard anonymisation techniques for tabular data, such as aggregation, is difficult to apply to text data. The standard way to de-identity texts is through masking.
This guide aims to walk you through the steps of this anonymisation pipeline as it applies to text. In particular, we will provide practical guidance on how you can detect PII (Chapter 2), mask PII (Chapter 4) and assess disclosure risk and utility (Chapter 5). We will conclude with a discussion of this guidance relates to the legal concepts of anonymity (Chapter 6) in the European General Data Protection Regulation (GPDR).