5  Evaluating the risk of sharing a de-identified text

Once you have found appropriate replacements for the PPIs you have identified in your text, you are ready to assess how well your chosen replacements preserve privacy in your documents. This section details how you go about assessing the residual disclosure risk of your de-identified documents.

5.1 Generalities

There is no ‘one size fits all’ when it comes to text identification, as the quality of the desired output depends on several factors:

As we have seen in Chapter 2, PII can come in a wide variety of forms, and must correspond to either direct identifiers or quasi-identifiers. The type of PII that ought to be detected and masked will vary from domain to domain. While some PII can be expected to be found in many types of texts (such as names of persons and locations), others will be highly domain specific. For instance, previous convictions may constitute an important PII in court rulings, but will rarely be found in other types of documents. A crucial step in the design of any de-identification system is therefore to define the PII categories one wishes to take into account.

What constitutes the utility of the text strongly depends on the envisioned use case.

In many settings, the text documents are de-identified in order to be published or shared with third parties. In those cases, the de-identified text should be kept human-readable and retain as much as possible the content that the intended audience is looking after in those documents:

  • De-identified court rulings are an important information sources for legal professionals (judges, lawyers, etc.), who are reading those rulings to understand the relation between the factual elements that constitutes the case and the jugment that was pronounced.

  • Similarly, patient records are often de-identified to make them easier to share among medical professionals and researchers. The utility of the de-identified records lies therefore in the descriptions of medical conditions, symptoms, diagnoses and treatments that are part of the record.

Defining what to mask should therefore both consider which information might potentially lead to disclosure risks, but also what is important to retain in the edited text.

Text documents are not always de-identified for the purpose of sharing them with the public or with third parties. Indeed, de-identification can also be applied as a preprocessing step before sending documents to an external system employed for automated classification or filtering. In this use case, the de-identified texts do not need to be human-readable, and may be instead converted to a vector representation that is both privacy-preserving and optimised for the downstream task (Sousa and Kern (2023)).

Yet another end use is to de-identify text documents for the purpose of building a training set, for instance to train or fine-tune a language model. The main privacy risk here is not so much that data collection itself (at least if the data is stored securely), but rather the privacy leakages that can arise from the trained model (Kim et al. (2024)).

In tabular databases, a row typically expresses information about a single individual. This is not the case for text documents, which may mention multiple individuals and their relations to one another. It is not always necessary to protect the identity of all individuals. For instance, a court case may mention several defendants, victims, witnesses, judges and lawyers. While the defendants, victims, and witnesses should typically be protected, the individuals that are involved in the court ruling in their professional capacity (judges, lawyers, and witness that participate as experts in the case) can often be kept in clear text.

It is sometimes also desirable to protect the identity of the author of the text. This is particularly important for user-generated content on social media, but may also happen in other domains (for instance, one may wish to conceal the identity of the doctor who wrote a particular note in a patient record). Approaches to author obfuscation can be used to this end (see Section 8.1).

The typically purpose of de-identification is to prevent identity disclosure - that is, ensure that the identity of the person(s) mentioned in the document can no longer be inferred, either directly or indirectly. There are, however, other types of disclosure that may harm privacy:

  • Attribute disclosure happens when a particular attribute of the person is revealed, for instance the person ethnicity, sexual orientation, or health condition. Crucially, attribute disclosure may occur even when the identity of the person remains unknown1. There has been several approaches designed to prevent specific types of attribute disclosure such as gender or ethnicity (see e.g. Elazar and Goldberg (2018),Xu et al. (2019)). Preventing attribute disclosure may be beneficial in some use cases, but often requires rewriting large swaths of the input text.

  • Membership disclosure happens when an adversary is able to determine whether a particular individual is included in a given dataset (in our case, a collection of text documents). To our knowledge, the only approaches developed to hinder membership disclosure (along with other types of disclosures) are based on Differential Privacy (DP)2 A notable example of recent work on DP applied to text is the DP-BART model of Igamberdiev and Habernal (2023). However, it should be noted that the output of those models is a text that is markedly different in both form and content to the original (although Meisenbacher and Matthes (2024) showed how those measures of text similarity can be used to ensure the new text does not depart too much from the original one). In other words, although those methods are well-suited to generate synthetic texts, for instance for the purpose of training/fine-tuning LLMs, they seek to address a slightly different task than standard de-identification.

De-identification is at its core an adversarial problem: we seek to transform the text to prevent a (real or imagined) adversary from being able to single out the individual that was referred to in a given document. Such adversaries will seek to perform this re-identification by exploiting the available background knowledge. A common choice (followed by e.g. Sánchez and Batet (2016)) is to simply assume this background knowledge is constituted of all texts published on the Web.

In some circumstances, there might reasons to believe the adversary might have access to background knowledge beyond what can be found through web search. For instance, adversaries might already have a list of individuals that they suspect may be referred to in the dataset.

For testing purposes, it is also useful to consider the case where the original documents (prior to de-identification) are themselves part of this background knowledge. This assumption is typically not very realistic (if the adversary does already have access to the original documents, they have no need to re-identify the de-identified version of those documents), but it does constitute a useful “worst-case scenario”. In particular, it makes it possible to assess whether the original and de-identified versions of the same document can be linked together.

5.2 Risk assessments based on human annotations

Measuring the residual privacy risk associated with a de-identified document is a difficult problem (see Papadopoulou et al. (2023) for a review of possible options). Two evaluation strategies are possible:

  • We can measure the risk by asking human experts to de-identity the same document, and then measure the difference between the expert de-identification and the de-identification produced by the system.

  • Alternatively, we can measure the risk by attempting to re-identify the documents, using either human or artificial “adversaries”

The first evaluation strategy is to ask a number of human experts to manually de-identify a small of documents and consider those expert de-identifications as a “gold standard” that the de-identification should seek to emulate.

5.2.1 Annotation process

One of the first most important steps in annotating documents for PII is to write good annotation guidelines that the human annotators will rely on in their work. Guidelines should serve as a comprehensive manual that clearly defines what constitutes each PII and provides clear examples. They should define clear definitions along with concrete examples. Importantly, they should also describe what annotators should do in “corner cases” where annotators may differ in how they understand the instructions. The guidelines should ideally be written in an interative fashion, as many of the corner cases will only be identified after doing an initial round of annotations.

Example guidelines

An example of guidelines for text de-identification can be found here.

There are several tools that can be used to collect human annotations of texts data. Popular open-source tools are Doccano and BRAT, while Prodigy is a good commercial solution. It is also possible, although perhaps less efficient, to annotate text directly in a text editor, using XML tags or similar formats.

You should ideally have several annotators, as having several participants will make it easier to identify potential problems or discrepancies. For high-quality annotations, it is preferable to select annotators you know, compared to hiring crowdworkers that are cheap but generally provide low-quality annotations. Annotators should receive a short training on the guidelines and the annotation tool. They should also communicate with one another to discuss cases they are unsure about, and if necessary update the guidelines.

There is always a balance to strike between the size and quality of the dataset. For best quality, each document should be annotated by several annotators to correct each other’s mistakes and omissions. However, under constant annotation resources (number of work hours by the annotators), this also means a lower total number of annotated documents. You should also compute the inter-annotator agreement to assess whether your annotators end up with consistent annotations or not (Artstein and Poesio (2008)). Two appropriate inter-annotator agreement (IAA) measures for text de-identification are Cohen’s \(\kappa\) and Krippendorff’s \(\alpha\).

As always in machine learning, the collection of annotated documents on which those \(TP\), \(FN\) and \(FP\) counts should be extracted needs to be separate from the documents employed for developing the de-identifier tool (doing otherwise amounts to cheating).

5.2.2 Precision, recall and \(F_1\)

Based on the expert annotations (considered as a “gold standard”), one can run the de-identification tool on the same documents, and count the number of text spans that are masked by the expert(s) and the de-identification tool:

Table 5.1: Count of text spans masked by human expert(s) and de-identification tool
masked by human annotators masked by de-identification tool
True positives (TP) Yes Yes
False positives (FP) No Yes
False negatives (FN) Yes No

With the numbers in Tab. 5.1, one can then compute the precision \(P\) and recall \(R\) of our tool:

\[P = \frac{TP}{FP + TP}\]

Precision (\(P\)) measures the proportion of text spans masked by the de-identification tool that are indeed PII (according to the experts)

\[R = \frac{TP}{FN + TP}\]

Recall (\(R\)) measures the proportion of PII text spans (again, according to the experts) that the de-identification tool managed to detect

The two measures can be combined into one metric called \(F_1\), and which is defined as the geometric mean of the two:

\[F_1 = 2 \frac{P * R}{P + R}\]

From a privacy perspective, the most important measure is the recall, as it captures the extent to which the de-identifier is able to recognize what needs to be masked.

The precision, on the other hand, is an indicator of the system’s ability to preserve the utility of the text. Indeed, a high-precision de-identifier will only mask the text spans that must be masked, but not more.

Warning

Evaluating the de-identification performance using the standard precision, recall, and \(F_1\) metrics is a good start, but also has some important limitations:

  • Traditional recall measures consider all types of PII in the same manner, and thus fail to account for the fact that some PII have a much larger influence on the disclosure risk than others. In particular, failing to detect a direct identifier such as a full person name is much more harmful from a privacy perspective than failing to detect a quasi-identifier.

  • Standard recall is typically applied at the level of text spans. However, for PII that are mentioned multiple times within the same document, it only makes sense to consider a PII as “concealed” if all of its occurrences are masked3.

Pilán et al. (2022) presented a novel set of three privacy-oriented evaluation metrics that seek to address those limitations:

  1. An entity-level recall on direct identifiers \(\textit{ER}_{di}\)

  2. An entity-level recall on quasi identifiers \(\textit{ER}_{qi}\)

  3. A token-level weighted precision on both direct and quasi identifiers \(\textit{WP}_{di+qi}\).

Please see the paper above for technical details about those metrics, as well as the associated Python code made available to apply those metrics to your test set.

It should be noted that all the metrics described above exclusively focus on the task of detecting PII, and do not assess the quality of the replacements.

5.3 Risk assessments based on re-identification attacks

One can also assess the disclosure risk by seeking to re-identify the text that is supposed to be free from PII. This re-identification can be done either manually or using “automated attackers”. The latter is, however, still an active research area, and there are no tools that can be applied to arbitrary texts.

The evaluation metrics described above are very useful to determine the PIIs that are correctly masked by the de-identification tool and the ones that are not. But they do require the work of human annotators to create the test set on the basis of which we can compare the output the de-identifier. Furthermore, human annotations can be prone to errors, omissions and inconsistencies.

As the overarching goal of de-identification is to minimize the disclosure risk, an alternative to human annotations is to directly conduct re-identification attacks and determine whether one can find out the identity of the individual we seek to protect based on the information that remains in the de-identified document.

5.3.1 Human re-identification attacks

One possible approach is to ask human experts to simply have a look at the de-identified document and try to find out (given some time limits) the identity of the person(s) based on some background knowledge. A crucial question is then which background knowledge to provide to those human attackers. A common assumption, used in e.g. Sánchez and Batet (2016) and Pilán et al. (2022), is to assume that anything published on the web can be used as background knowledge, and thus that the human attacker is free to browse the web for various cues and information pieces that may help them uncover the identity of the person. This background knowledge may be augmented with other information sources, such as documents related to the de-identified text or a target list of persons that suspects to be mentioned in the documents that have been de-identified.

5.3.2 Automated attacks

Human-led re-identification attacks are of course quite time-consuming4. Although the development of automated attackers is very much an ongoing research question, several approaches have been proposed. In particular, Manzanares-Salor, Sánchez, and Lison (2024) propose an approach that uses a set of documents related to the de-identified texts as “background knowledge”. Those background documents are then employed to train a machine learning model that takes a de-identified document as input and seeks to predict the identity of the individual associated with it.

Ongoing (currently unpublished) research on retrieval-augmented re-identification attacks also shows that is is possible to build an attacker that automatically fetches relevant texts from a database of background documents, and uses them to predict plausible values for the masked PIIs of the de-identified document.

Although automated attackers are theoretically appealing, it is worth stressing that:

  • Their performance is highly dependent on the background knowledge that is assumed to be available for the attack.

  • They might not cover the full set of possible inferences and reasoning patterns that a human attacker can draw upon.

5.4 Evaluation of the data utility

Text de-identification must address two competing objectives: minimize the risk of disclosing personal information, but also do so while retaining as much content as possible, such as to preserve the utility of the document for downstream uses. How can we estimate the loss of utility resulting from the de-identification process?

At the document level, one simple approach is to measure the data quality in terms of the similarity between the initial and the de-identified documents, based on the assumption that a high similarity indicates that the de-identification could preserve the core semantic content of the text. The most straighforward method is to derive document vectors for both the initial and the de-identified documents, for instance using sentence transformers, and then compute the cosine similarity between the two vectors. More advanced methods for computing similarity measures are of course possible, based on e.g. neural text matching (Yang et al. (2019)).


  1. Imagine a set of 1000 patient records, 5 of which are related to persons that are 36 years old, and that those 5 persons all suffer from a specific heard condition. Then, if one knows a person that is is 36 years old and whose patient record is part of the data collection, one can infer that this person suffers from a heart condition, even without needing to determine which of those 5 patient records belong to that individual.↩︎

  2. Differential privacy (Dwork et al. (2006)) is a privacy model that defines anonymisation in terms of randomised algorithms for computing statistics from the data. DP provides guarantees that the statistics cannot be used to learn anything substantial about any individual.↩︎

  3. For instance, if a person name is mentioned 4 times in a document, and the anonymization method is able to correctly mask three of those mentions, the anonymized text will still retain one mention of that person name in clear text – an information that can be exploited by an adversary seeking to re-identify the individual we aim to protect.↩︎

  4. Even more than human annotations, as they re-identification attacks need to be rerun every time the de-identification tool is updated. In contrast, once annotated, a dataset can be easily applied to various iterations of the de-identifier.↩︎