6 Can text ever be truly anonymized legally?

Anonymization refers to the process of removing from a dataset all pieces of information that may, directly or indirectly, identify a human individual. To be considered anonymous, this process must be both complete and irreversible (that is, it should be impossible to revert the process and recover the information that was concealed). This is a rather stringent requirement. Is it possible to achieve for text documents?

This section will first review the legal requirements around anonymisation, looking more specifically at the General Data Protection Regulation (GDPR) introduced in Europe in 2016. We will then see how those can be applied to text data, and illustrate it with a concrete use case.

6.2 Anonymisation requirements

So, how can we ensure that data is no longer personal? We have seen that the notion of personal data is a kernel of the GDPR and further that a central notion of ‘personal data’ is that the data subject is identified or in principle identifiable. Determining identifiability is therefore a key test of anonymity.

GDPR, Recital 26

To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

Recital 26 stipulates a test of reasonable likelihood of identification which duly conciders technological developments and relative the effort that an attacker must invest to achieve desired outcomes.

At the time of writing explicit guidance on how this likelihood should be assessed is pending. Our clues must therefore be drawn from guidance that predates the GDPR, where the Article 29 Working Party (WP) has issued conflicting positions which continue to be a source of scholarly debate.

The Article 29 Working Party may be viewed as the EDPB’s predecessor.

Early position (WP 136)

In 2007 the WP issued a position (WP 136) detailing the factors that should be taken into account when assessing the risk of identification:

the cost of conducting identification
the intended purpose of processing
the risk of organisational dysfunctions (e.g. data breaches) and technical failures
the state-of-the-art in technology at the time of processing and the possibilities of technological developments during the lifetime of the processing
technical and organisational measure in place to provent identification
amount of time required for identification

Crucially, the WP 136-approach considers anonymity as function between data and its wider “environment” and conceedes that “appropirate technical measures” may suffice to render data anonymous.

Later position (WP 216)

In 2014, the WP issued a more detailed position on anonymisation techniques (WP 216) where the difference of pseudonimisation and anonymisation lies principally in the likelihood of reidenfication. According to WP 216, three specific re-identification risks must be addressed for personal data to be considered anonymous:

Singling out - the ability to locate an individual’s records in the data
Linkability - the ability to link two or more records pertaining to the same individual or group of individuals
Inference - the ability to confidentally deduce or estimate the value of an attribute from the values of a set of other attributes

According to WP 216, anonymisation proceedures that are robust against these three risks should stand up to regulatory scrutiny.

To date, WP 216 provide the most explicit criteria against which data anonymity, including text anonymity, can be assessed. We will now see how these can be applied to text data.

6.3 Linkage attacks

A text consists of a number of unique words, in unique sequences. Even if revealing items in text are removed and redacted or replaced with alternatives, the resulting text will likely bear great semblance to the original document. This increases the risk of so-called ‘linkage attacks’, whereby the attacker is able to link back the sanitised document to its source and hence extract personal information.

Linking back a sanitised document to its source text is rather straightforward. The easiest way, demonstrated in (Weitzenboeck et al. 2022), is to automatically search for words or phrases that occur in both the sanitised document and in the set of possible target documents. One can then link the sanitized text to the document that has the largest proportion of common words or phrases. We show how to achieve this in a concrete use case below.

Of course, those attacks are only possible if the (real of imagined) attacker can be expected to have access to the original set of documents - that is, the raw documents, before the sanitization process took place. In the vast majority of cases, this is a very unlikely scenario, as those original documents (for instance, patient records stored in the IT database of a local hospital) are only available to the data owner

Nevertheless, the mere possibility of such linkage attacks means the non-linkability requirement stipulated by WP 216 is not satisfied. Unless, of course, one is willing to permanently delete the original data, such that the linkage is no longer feasible. This is, however, a rather radical approach, which is rarely possible in practice, and often prohibited by other legal provisions[^ For instance, this would mean that, in order to anonymise a patient record stored at a hospital to provide it to third parties (like medical researchers), one would need to erase the original patient record from the IT database from the hospital.].

6.4 What should you do?

Given the current state of affairs, the safest approach is to consider that text data cannot be anonymised (unless the original data is deleted) and will always remain personal data, even after the use of de-identification measures. This means that even after following the de-identification measures described herein, the continued processing of de-identified text data still requires a properly defined legal basis, such as consent, contract, legal obligation, vital interests, public task or legitimate interests.

It should, however, be noted that the “absolutist” interpretation of the anonymisation criteria that follows from WP 216 is not the only possible one. (Weitzenboeck et al. 2022) argues that a risk-based approach offers a more nuanced test consistent with GDPR’s purposes. However, in the absence of clear guidelines on this issue by the European Data Protection Board (EDPB), it remains unclear which interpretation to adopt.

Even after removing personal identifiers from a text, it remains often possible to link back the sanitised text to its original document, for instance by searching for common words or phrases. Given the current interpretation of GDPR’s requirements, this means that de-identified texts cannot never be considered fully anonymous. However, they do substantially reduce the disclosure risk, and help us adhere to the data minimisation principle.

This does not mean, however, that text de-identification has no benefit. Indeed, data minimisation is one of the core principles of the GDPR and most privacy regulations (Goldsteen et al. 2022). The principle of data minimization states that one should only collect and retain the personal data that is strictly necessary to fulfill a given purpose. By masking the personal identifiers occurring in the text, text de-identification can drastically reduce the risk of disclosing personal information, and therefore help the data owner adhere to this data minimisation principle.

6.5 Case study: Anonymisation of court cases

Let us illustrate the problem with a case study (initially published by Weitzenboeck et al. (2022)) related to the anonymisation 13 759 court cases from the European Court of Human Rights (ECHR) .

The cases include detailed, plain text information about various individuals, such as:

name
date of birth
criminal record
family status
etc.

The information pertains to both plaintiffs and parties involved such as witnesses, lawyers, judges and government agents. An example is provided Tab. 6.1 below:

Table 6.1: Excerpt from an ECHR court case (nr 61391/00) - original

Line	Text
1	The applicant [Mr Colin Joseph O’Brien] was born in 1955 and lives in Bridgend.
2	His wife died on 29 April 1999 leaving two children, born in 1989 and 1991.
3	In 1999 the applicant enquired about widows’ benefits and he was informed that he was not entitled to such benefits.
4	In early 2000 the applicant applied for widows’ benefits again and on 13 March 2000 the Benefits Agency rejected his claim.
5	He lodged an appeal against this decision on 16 March 2000 and this appeal was struck out on 23 May 2000 on the basis that it was misconceived.
6	On 16 May 2000 the applicant made an oral claim for Widow’s Bereavement Allowance to the Inland Revenue. On 23 May 2000 he was informed that his claim could not be accepted because there was no basis in domestic law allowing widowers to claim this benefit. The applicant was advised that an appeal against this decision would be bound to fail.
7	The applicant received child benefit in the sum of GBP 100 per month.

Table 6.2: De-identified, masking direct and indirect identifiers, to address WP216 risk criteria of “singling-out” and “inference”.

Line	Text
1	The applicant [`-xxx-`] was born in `-xxx-` and lives in `-xxx-`.
2	His wife died on `-xxx-` leaving `-xxx-` children, born in `-xxx-`.
3	In `-xxx-` the applicant enquired about widows’ benefits and he was informed that he was not entitled to such benefits.
4	In `-xxx-` the applicant applied for widows’ benefits again and on `-xxx-` the `-xxx-` rejected his claim.
5	He lodged an appeal against this decision on `-xxx-` and this appeal was struck out on `-xxx-` on the basis that it was misconceived.
6	On `-xxx-` the applicant made an oral claim for Widow’s Bereavement Allowance to the Inland Revenue. On `-xxx-` he was informed that his claim could not be accepted because there was no basis in domestic law allowing widowers to claim this benefit. The applicant was advised that an appeal against this decision would be bound to fail.
7	The applicant received child benefit in the sum of `-xxx-` per month.

Table 6.3: WP216-compliant version where all phrases that can potentially link back to the original document are masked.

Line	Text
1	The applicant [`-xxx-`] was born in `-xxx-` and lives `-xxx-` `-xxx-`.
2	`-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` two `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-`.
3	In `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` was `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-`.
4	In `-xxx-` the applicant `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` the `-xxx-` `-xxx-` his `-xxx-` `-xxx-`.
5	`-xxx-` `-xxx-` an `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` the `-xxx-` that it was `-xxx-` `-xxx-`.
6	`-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` for `-xxx-` `-xxx-` `-xxx-` `-xxx-` the `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` could `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` in `-xxx-` law `-xxx-` `-xxx-` to `-xxx-` this `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` this `-xxx-` `-xxx-` `-xxx-` `-xxx-` to `-xxx-` `-xxx-`.
7	The `-xxx-` `-xxx-` `-xxx-` `-xxx-` in the `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-` `-xxx-`.

6.5.1 De-identification to reduce risk of singling-out and inference

In order to reduce the risk of “singling-out” and identifying individuals in this record, we first attempt to remove all direct and potentially indirect identifiers, such as:

names of individuals
names of organisations
places and geographical locations
date and time indicators
demographic attributes (age, gender, ethnicity, etc.)
quantities (monetary values, number of convictions, etc.)
codes (application number, phone number, etc.)

The result is seen in Tab. 6.2. The only residual information we may infer from the text is that it relates to * a male person * a widower * a father of more than one child * likely a British national (reference to British public agencies) * who has been denied widows´ benefit at an undisclosed point in time

It in absence of additional information, we may perhaps reasonably assume that this information alone is insufficient to “single-out” the identity of this person.

6.5.2 Masking to reduce risk of linkability

Finally, WP216 requires an assessment of linkability. Can the de-identified document be linked back to the original (provided the data controller retains the original)?

Linkability is easily achieved by searching for overlapping text phrases found in both the de-identified text and the original source. Most phrases of more than 5-6 words will rarely occur more than once in a collection of text documents, and can therefore easily be exploited to link de-identified documents with their original source. In the 13 759 court cases employed

The phrase “was advised that an appeal against” found in element 6 in the above example, occurs only once in the 13759 court cases considered in this case study.
Phrases like “rejected his claim” and “could not be accepted” found in elements 4 and 6 respectively are individually found in multiple documents in this corpus, but they co-occur only once.

To robustly protect against linkability, one would therefore need remove all phrases that can, either in isolation or combination, be employed to trace back to the original version. This can be done by way of an inverted index, mapping each word/phrase to documents where they appear:

from typing import List, Dict, Set

def create_inverted_index(documents: List[str]) -> Dict[str, Set[int]]:
    """
    Create an inverted index for the given list of documents.

    Args:
        documents: A list of strings representing the documents.

    Returns:
        An inverted index dictionary where the keys are words and the values are sets of document IDs.

    """
    inverted_index = {}
    for doc_id, document in enumerate(documents):
        words = document.split()
        for word in words:
            if word not in inverted_index:
                inverted_index[word] = set()
            inverted_index[word].add(doc_id)
    return inverted_index

def find_unique_phrases(inverted_index):
    unique_phrases = []
    for phrase, doc_ids in inverted_index.items():
        if len(doc_ids) == 1:
            unique_phrases.append(phrase)
    return unique_phrases

def remove_unique_phrases(documents, unique_phrases):
    """
    Remove unique phrases from a collection of documents.

    Args:
        documents: A list of strings representing the documents.
        unique_phrases: A list of unique phrases to be removed.

    Returns:
        A list of strings representing the documents with unique phrases removed.

    """
    cleaned_documents = []
    for document in documents:
        words = document.split()
        cleaned_words = [word for word in words if word not in unique_phrases]
        cleaned_document = " ".join(cleaned_words)
        cleaned_documents.append(cleaned_document)
    return cleaned_documents

As the example in Tab. 6.3 shows, this will often require the removal of most of the document’s content, severely reducing the information utility of the remaining document. Setting aside prepositions and articles, the only remaining content words are ‘applicant’, ‘born’, ‘lives’ and ‘law’.