1 Why is anonymising text so difficult?

The goal of text anonymisation, also called text de-identification and text sanitisation, is to edit a text documents such as to conceal the identity of the person(s) which may be mentioned or referred to. This is typically done in a series of steps:

    flowchart LR
        A(A \n Detect PII) --> B( B \n Mask PII)
        B --> C( C \n Assess disclosure\n risk and utility)
        C --> D( D \n Release sanitized text)

Figure 1.1: Common steps in the de-identification pipeline.

Detect PII: Search the text for occurrences of Personally Identifiable Information (PII) such as person names, adresses or other information that may result in confidentiality disclosure. Occurences may be single words, phrases or even passages.
Mask PII: Problematic PII must be either masked (removed) or replaced by alternatives in the text.
Assess disclosure risk and utility: Check if the edits you have made really reduce the disclosure risk of to a satisfactory level, without distorting the semantic content of the original document. If not, rinse and repeat.
Release sanitized text: If disclosure risk and utility are deemed satisfactory, release the de-identified text.

From a birds-eye perspective, the steps in this pipeline are common to most anonymisation tasks, and many well-estabslished concepts and features, such as direct and indirect identifiers or the privacy-utility trade-off, are equally at play in a text de-identification pipeline. That being said, text comes with its own set of unique challenges, some of which are described below:

Heterogeneity of text data

Text appears in many forms, shapes and sizes. Text documents may sometimes follow formatting guidelines, but, in general, text data is anything that can be encoded as a sequence of characters. Text is thus a type of unstructured data, along with e.g. images, audio or video recordings.

Unstructured data can be contrasted with structured data formats such as tables, where each row is defined with a fixed, predefined set of columns, each column corresponding to a specific information and allowing a specific range of possible values. Let us take a particular example:

Table 1.1: Table of example data

Name	Date of birth	Gender	Nationality	Vaccination Status
Peter Higgs	30.07.1975	Male	British	2 shots
Andreas Sauner	02.10.1981	Male	German	No shot
Laurence Barrière	03.10.1957	Female	French	1st shot

Tab. 1.1 has 5 columns, each associated with a specific set of possible values. For instance, Date of birth must be a valid date in a particular range, and Nationality must be drawn from a predefined list of possible nationalities.

Tabular data is comparatively “easy” from a privacy point of view, as one can inspect each column and determine the extent to which they relate to (direct or indirect) personal identifiers and/or include confidential information.

Let us now compare the above table to a free-form text representation of the same information:

Peter Higgs, born on July 30, 1975, is a UK national and has already received 2 shots of the vaccine, while his German colleague Andreas Sauner, who will celebrate his 40th birthday on October 2, did not yet receive any shot. Meanwhile, their common acquaintance Laurence Barrière recently got her first vaccine shot. Mrs. Barrière is French and will turn 64 years old on October 3.

Altough the above text expresses the same information as the table, much of the data’s internal structure (such as the name and values of each attribute) is now implicit. The text also illustrates the multiple ways in which attributes such as geneder, age or nationality can be expressed.

While a structured database typically contains one record per individual, a text document may also simultaneously express personal information about multiple individuals and their relations to another. (Indeed, the text indicates something that the table does not, namely the three individuals know each other: two are colleagues and they are acquainted with the third.)

Text is a type of unstructured data, and may convey personal information in a myriad of ways, without following a predefined format.

Ambiguity of natural language

A given word or phrase may have multiple meanings, which are often only inferred through the context they appear in. The word “Apple” may refer to a fruit or an IT-company, and in the phrase “the apple of my eye”, it refers to a person that is cherished. The meaning the word carries is inferred from context.

Robust anonymization of text therefore needs to take contextual factors into consideration when deciding which part of the text may contribute to the risk of disclosing personal information.

For instance, many Norwegian first names and surnames also happen to be common Norwegian nouns, like ‘Dag’, ‘Bjørn’, ‘Liv’, ‘Bjørn’, ‘Stein’, ‘Strand’ or ‘Ulv’. One therefore needs to look at the context to decide whether ‘Dag’ should be interpreted as the first name of a person, or simply as a Norwegian word for ‘day’.

Many words are ambiguous - that is, their meaning may vary depending on the context of their occurrence. This is why we cannot simply de-identify a text by creating a large vocabulary of words/phrases to remove.

Ubiquity of indirect identifiers

Text documents are often full of indirect/implicit “cues” that can provide disclose information about the identity of various individuals. Those cues make take various forms. For instance, in the text above, the gender of the two last individuals is not mentioned explicitly but through the use of possessive adjectives (respectively ‘his’ and ‘her’). Personal information may also be acquired through logical inferences – for instance, if one we were to write that Peter Higgs was from England, we could derive that his nationality is British.

We sometimes also wish to conceal the identity of the author of a given text. This is an important issue when working with user-generated context (such as texts crawled from social media and other online sources). In this case, simply masking text spans that express PII is not sufficient. Indeed, the writing style of the author may reveal their identity even in the absence of any explicitly mentioned personal identifier. As the main focus of this guide is on the detection of PII in text, we will mostly ignore the problem of author obfuscation, but provide some references in the last section.

Text documents often include various “cues” that provide indirect information about one or more individuals mentioned in the document.

Traditional anonymisation techniques fall short

Many techniques have been developed to ensure the confidentiality of tabular data, such as .eg. generalisation, suppression, perturbation or aggregation. Aggregation, in particular, is a commonly employed to bundle individuals into larger groups thus limiting the disclosure of personal information about individuals. None of these can be employed on text in a straightforward fashion. As text is unstructured, two documents will rarely follow the exact same structure, and aggregation is therefore seldom a feasible option.

When dealing with text-data we therefore tend to rely on various masking techniques, where individual words or phrases are redacted or replaced by alternatives. Such operations, which we cover in section 6 may sometimes alter the semantic content of the sentence of phrase.

Standard anonymisation techniques for tabular data, such as aggregation, is difficult to apply to text data. The standard way to de-identity texts is through masking.

What constitutes legally compliant text sanitation remains unresolved

Anonymization refers to the process of removing all pieces of information that may, directly or indirectly, identify an individual. To meet legal requirements around anonymisation, specifically those pertaining to the GDPR, this process must be both complete and irreversible, such that it for all practical purposes is impossible to revert the process and recover the information that was concealed. This is a rather stringent requirement. Is it possible to achieve for text documents?

As discussed in section legal it is currently an unresolved matter of scholarly debate, wheather “GDPR-compliant” text anonymisation is possible in practice.

This guide aims to walk you through the steps of this anonymisation pipeline as it applies to text. In particular, we will provide practical guidance on how you can detect PII (Chapter 2), mask PII (Chapter 4) and assess disclosure risk and utility (Chapter 5). We will conclude with a discussion of this guidance relates to the legal concepts of anonymity (Chapter 6) in the European General Data Protection Regulation (GPDR).