6 Can text ever be truly anonymized legally?
Anonymization refers to the process of removing from a dataset all pieces of information that may, directly or indirectly, identify a human individual. To be considered anonymous, this process must be both complete and irreversible (that is, it should be impossible to revert the process and recover the information that was concealed). This is a rather stringent requirement. Is it possible to achieve for text documents?
This section will first review the legal requirements around anonymisation, looking more specifically at the General Data Protection Regulation (GDPR) introduced in Europe in 2016. We will then see how those can be applied to text data, and illustrate it with a concrete use case.
6.1 GDPR and the concept of “anonymous data”
The GDPR applies to the processing of personal data. The notion of personal data has wide scope: any information that relates to an identifiable person is considered ‘personal’, regardless of its perceived or actual “sensitivity”. If the information can be hinged to an individual, it is deemed personal and its processing falls within the scope of the regulation. Conversely, if it is not deemed as such the GDPR does not apply.
Any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person. REF
This distinction between personal and non-personal data is therefore of critical importance to understand : the data may be altered, and stripped and reduced, but as long as it is deemed personal, the GDPR applies. Importantly, pseudonymous data altered in such a way that personal data can no longer be attributed to a specific data subject without the use of external, additional information, is still considered personal - precisely because ‘additional information’ can lead to re-identification and information leakage.
Anonymisation, therefore, is the process by which personal data is rendered ‘non-personal’ such that it falls outside the scope of the GDPR. That anonymous data has no explicit definition in the GDPR may seem surprising, but is likely a feature and not a bug which ensures legal longevity in face of rapid technological developments. The GDPR concerns itself with personal data, widely defined. Anonymous data can therefore only be understood as its antithesis and the processor (namely you!) must find ways to ensure your data is out of scope.
Note that anonymisation inevitably implies that ‘personal data’ is being processed. As such, the anonymisation process itself falls under the scope of the GDPR and all steps in Fig. 1.1 must adhere to all requirements therein. This includes (but is not limited to) that a legal basis for the anonymisation must exist.
6.2 Anonymisation requirements
So, how can we ensure that data is no longer personal? We have seen that the notion of personal data is a kernel of the GDPR and further that a central notion of ‘personal data’ is that the data subject is identified or in principle identifiable. Determining identifiability is therefore a key test of anonymity.
To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
Recital 26 stipulates a test of reasonable likelihood of identification which duly conciders technological developments and relative the effort that an attacker must invest to achieve desired outcomes.
At the time of writing explicit guidance on how this likelihood should be assessed is pending. Our clues must therefore be drawn from guidance that predates the GDPR, where the Article 29 Working Party (WP) has issued conflicting positions which continue to be a source of scholarly debate.
The Article 29 Working Party may be viewed as the EDPB’s predecessor.
To date, WP 216 provide the most explicit criteria against which data anonymity, including text anonymity, can be assessed. We will now see how these can be applied to text data.
6.3 Linkage attacks
A text consists of a number of unique words, in unique sequences. Even if revealing items in text are removed and redacted or replaced with alternatives, the resulting text will likely bear great semblance to the original document. This increases the risk of so-called ‘linkage attacks’, whereby the attacker is able to link back the sanitised document to its source and hence extract personal information.
Linking back a sanitised document to its source text is rather straightforward. The easiest way, demonstrated in (Weitzenboeck et al. 2022), is to automatically search for words or phrases that occur in both the sanitised document and in the set of possible target documents. One can then link the sanitized text to the document that has the largest proportion of common words or phrases. We show how to achieve this in a concrete use case below.
Of course, those attacks are only possible if the (real of imagined) attacker can be expected to have access to the original set of documents - that is, the raw documents, before the sanitization process took place. In the vast majority of cases, this is a very unlikely scenario, as those original documents (for instance, patient records stored in the IT database of a local hospital) are only available to the data owner
Nevertheless, the mere possibility of such linkage attacks means the non-linkability requirement stipulated by WP 216 is not satisfied. Unless, of course, one is willing to permanently delete the original data, such that the linkage is no longer feasible. This is, however, a rather radical approach, which is rarely possible in practice, and often prohibited by other legal provisions[^ For instance, this would mean that, in order to anonymise a patient record stored at a hospital to provide it to third parties (like medical researchers), one would need to erase the original patient record from the IT database from the hospital.].
6.4 What should you do?
Given the current state of affairs, the safest approach is to consider that text data cannot be anonymised (unless the original data is deleted) and will always remain personal data, even after the use of de-identification measures. This means that even after following the de-identification measures described herein, the continued processing of de-identified text data still requires a properly defined legal basis, such as consent, contract, legal obligation, vital interests, public task or legitimate interests.
It should, however, be noted that the “absolutist” interpretation of the anonymisation criteria that follows from WP 216 is not the only possible one. (Weitzenboeck et al. 2022) argues that a risk-based approach offers a more nuanced test consistent with GDPR’s purposes. However, in the absence of clear guidelines on this issue by the European Data Protection Board (EDPB), it remains unclear which interpretation to adopt.
Even after removing personal identifiers from a text, it remains often possible to link back the sanitised text to its original document, for instance by searching for common words or phrases. Given the current interpretation of GDPR’s requirements, this means that de-identified texts cannot never be considered fully anonymous. However, they do substantially reduce the disclosure risk, and help us adhere to the data minimisation principle.
This does not mean, however, that text de-identification has no benefit. Indeed, data minimisation is one of the core principles of the GDPR and most privacy regulations (Goldsteen et al. 2022). The principle of data minimization states that one should only collect and retain the personal data that is strictly necessary to fulfill a given purpose. By masking the personal identifiers occurring in the text, text de-identification can drastically reduce the risk of disclosing personal information, and therefore help the data owner adhere to this data minimisation principle.
6.5 Case study: Anonymisation of court cases
Let us illustrate the problem with a case study (initially published by Weitzenboeck et al. (2022)) related to the anonymisation 13 759 court cases from the European Court of Human Rights (ECHR) .
The cases include detailed, plain text information about various individuals, such as:
- name
- date of birth
- criminal record
- family status
- etc.
The information pertains to both plaintiffs and parties involved such as witnesses, lawyers, judges and government agents. An example is provided Tab. 6.1 below:
Line | Text |
---|---|
1 | The applicant [Mr Colin Joseph O’Brien] was born in 1955 and lives in Bridgend. |
2 | His wife died on 29 April 1999 leaving two children, born in 1989 and 1991. |
3 | In 1999 the applicant enquired about widows’ benefits and he was informed that he was not entitled to such benefits. |
4 | In early 2000 the applicant applied for widows’ benefits again and on 13 March 2000 the Benefits Agency rejected his claim. |
5 | He lodged an appeal against this decision on 16 March 2000 and this appeal was struck out on 23 May 2000 on the basis that it was misconceived. |
6 | On 16 May 2000 the applicant made an oral claim for Widow’s Bereavement Allowance to the Inland Revenue. On 23 May 2000 he was informed that his claim could not be accepted because there was no basis in domestic law allowing widowers to claim this benefit. The applicant was advised that an appeal against this decision would be bound to fail. |
7 | The applicant received child benefit in the sum of GBP 100 per month. |
Line | Text |
---|---|
1 | The applicant [-xxx- ] was born in -xxx- and lives in -xxx- . |
2 | His wife died on -xxx- leaving -xxx- children, born in -xxx- . |
3 | In -xxx- the applicant enquired about widows’ benefits and he was informed that he was not entitled to such benefits. |
4 | In -xxx- the applicant applied for widows’ benefits again and on -xxx- the -xxx- rejected his claim. |
5 | He lodged an appeal against this decision on -xxx- and this appeal was struck out on -xxx- on the basis that it was misconceived. |
6 | On -xxx- the applicant made an oral claim for Widow’s Bereavement Allowance to the Inland Revenue. On -xxx- he was informed that his claim could not be accepted because there was no basis in domestic law allowing widowers to claim this benefit. The applicant was advised that an appeal against this decision would be bound to fail. |
7 | The applicant received child benefit in the sum of -xxx- per month. |
Line | Text |
---|---|
1 | The applicant [-xxx- ] was born in -xxx- and lives -xxx- -xxx- . |
2 | -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- two -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- . |
3 | In -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- was -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- . |
4 | In -xxx- the applicant -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- the -xxx- -xxx- his -xxx- -xxx- . |
5 | -xxx- -xxx- an -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- the -xxx- that it was -xxx- -xxx- . |
6 | -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- for -xxx- -xxx- -xxx- -xxx- the -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- could -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- in -xxx- law -xxx- -xxx- to -xxx- this -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- this -xxx- -xxx- -xxx- -xxx- to -xxx- -xxx- . |
7 | The -xxx- -xxx- -xxx- -xxx- in the -xxx- -xxx- -xxx- -xxx- -xxx- -xxx- . |
6.5.1 De-identification to reduce risk of singling-out and inference
In order to reduce the risk of “singling-out” and identifying individuals in this record, we first attempt to remove all direct and potentially indirect identifiers, such as:
- names of individuals
- names of organisations
- places and geographical locations
- date and time indicators
- demographic attributes (age, gender, ethnicity, etc.)
- quantities (monetary values, number of convictions, etc.)
- codes (application number, phone number, etc.)
The result is seen in Tab. 6.2. The only residual information we may infer from the text is that it relates to * a male person * a widower * a father of more than one child * likely a British national (reference to British public agencies) * who has been denied widows´ benefit at an undisclosed point in time
It in absence of additional information, we may perhaps reasonably assume that this information alone is insufficient to “single-out” the identity of this person.
6.5.2 Masking to reduce risk of linkability
Finally, WP216 requires an assessment of linkability. Can the de-identified document be linked back to the original (provided the data controller retains the original)?
Linkability is easily achieved by searching for overlapping text phrases found in both the de-identified text and the original source. Most phrases of more than 5-6 words will rarely occur more than once in a collection of text documents, and can therefore easily be exploited to link de-identified documents with their original source. In the 13 759 court cases employed
- The phrase “was advised that an appeal against” found in element 6 in the above example, occurs only once in the 13759 court cases considered in this case study.
- Phrases like “rejected his claim” and “could not be accepted” found in elements 4 and 6 respectively are individually found in multiple documents in this corpus, but they co-occur only once.
To robustly protect against linkability, one would therefore need remove all phrases that can, either in isolation or combination, be employed to trace back to the original version. This can be done by way of an inverted index, mapping each word/phrase to documents where they appear:
from typing import List, Dict, Set
def create_inverted_index(documents: List[str]) -> Dict[str, Set[int]]:
"""
Create an inverted index for the given list of documents.
Args:
documents: A list of strings representing the documents.
Returns:
An inverted index dictionary where the keys are words and the values are sets of document IDs.
"""
= {}
inverted_index for doc_id, document in enumerate(documents):
= document.split()
words for word in words:
if word not in inverted_index:
= set()
inverted_index[word]
inverted_index[word].add(doc_id)return inverted_index
def find_unique_phrases(inverted_index):
= []
unique_phrases for phrase, doc_ids in inverted_index.items():
if len(doc_ids) == 1:
unique_phrases.append(phrase)return unique_phrases
def remove_unique_phrases(documents, unique_phrases):
"""
Remove unique phrases from a collection of documents.
Args:
documents: A list of strings representing the documents.
unique_phrases: A list of unique phrases to be removed.
Returns:
A list of strings representing the documents with unique phrases removed.
"""
= []
cleaned_documents for document in documents:
= document.split()
words = [word for word in words if word not in unique_phrases]
cleaned_words = " ".join(cleaned_words)
cleaned_document
cleaned_documents.append(cleaned_document)return cleaned_documents
As the example in Tab. 6.3 shows, this will often require the removal of most of the document’s content, severely reducing the information utility of the remaining document. Setting aside prepositions and articles, the only remaining content words are ‘applicant’, ‘born’, ‘lives’ and ‘law’.