5 Evaluating the risk of sharing a de-identified text
Once you have found appropriate replacements for the PPIs you have identified in your text, you are ready to assess how well your chosen replacements preserve privacy in your documents. This section details how you go about assessing the residual disclosure risk of your de-identified documents.
5.1 Generalities
There is no ‘one size fits all’ when it comes to text identification, as the quality of the desired output depends on several factors:
5.2 Risk assessments based on human annotations
Measuring the residual privacy risk associated with a de-identified document is a difficult problem (see Papadopoulou et al. (2023) for a review of possible options). Two evaluation strategies are possible:
We can measure the risk by asking human experts to de-identity the same document, and then measure the difference between the expert de-identification and the de-identification produced by the system.
Alternatively, we can measure the risk by attempting to re-identify the documents, using either human or artificial “adversaries”
The first evaluation strategy is to ask a number of human experts to manually de-identify a small of documents and consider those expert de-identifications as a “gold standard” that the de-identification should seek to emulate.
5.2.1 Annotation process
One of the first most important steps in annotating documents for PII is to write good annotation guidelines that the human annotators will rely on in their work. Guidelines should serve as a comprehensive manual that clearly defines what constitutes each PII and provides clear examples. They should define clear definitions along with concrete examples. Importantly, they should also describe what annotators should do in “corner cases” where annotators may differ in how they understand the instructions. The guidelines should ideally be written in an interative fashion, as many of the corner cases will only be identified after doing an initial round of annotations.
An example of guidelines for text de-identification can be found here.
There are several tools that can be used to collect human annotations of texts data. Popular open-source tools are Doccano and BRAT, while Prodigy is a good commercial solution. It is also possible, although perhaps less efficient, to annotate text directly in a text editor, using XML tags or similar formats.
You should ideally have several annotators, as having several participants will make it easier to identify potential problems or discrepancies. For high-quality annotations, it is preferable to select annotators you know, compared to hiring crowdworkers that are cheap but generally provide low-quality annotations. Annotators should receive a short training on the guidelines and the annotation tool. They should also communicate with one another to discuss cases they are unsure about, and if necessary update the guidelines.
There is always a balance to strike between the size and quality of the dataset. For best quality, each document should be annotated by several annotators to correct each other’s mistakes and omissions. However, under constant annotation resources (number of work hours by the annotators), this also means a lower total number of annotated documents. You should also compute the inter-annotator agreement to assess whether your annotators end up with consistent annotations or not (Artstein and Poesio (2008)). Two appropriate inter-annotator agreement (IAA) measures for text de-identification are Cohen’s \(\kappa\) and Krippendorff’s \(\alpha\).
As always in machine learning, the collection of annotated documents on which those \(TP\), \(FN\) and \(FP\) counts should be extracted needs to be separate from the documents employed for developing the de-identifier tool (doing otherwise amounts to cheating).
5.2.2 Precision, recall and \(F_1\)
Based on the expert annotations (considered as a “gold standard”), one can run the de-identification tool on the same documents, and count the number of text spans that are masked by the expert(s) and the de-identification tool:
masked by human annotators | masked by de-identification tool | |
---|---|---|
True positives (TP) | Yes | Yes |
False positives (FP) | No | Yes |
False negatives (FN) | Yes | No |
With the numbers in Tab. 5.1, one can then compute the precision \(P\) and recall \(R\) of our tool:
\[P = \frac{TP}{FP + TP}\]
Precision (\(P\)) measures the proportion of text spans masked by the de-identification tool that are indeed PII (according to the experts)
\[R = \frac{TP}{FN + TP}\]
Recall (\(R\)) measures the proportion of PII text spans (again, according to the experts) that the de-identification tool managed to detect
The two measures can be combined into one metric called \(F_1\), and which is defined as the geometric mean of the two:
\[F_1 = 2 \frac{P * R}{P + R}\]
From a privacy perspective, the most important measure is the recall, as it captures the extent to which the de-identifier is able to recognize what needs to be masked.
The precision, on the other hand, is an indicator of the system’s ability to preserve the utility of the text. Indeed, a high-precision de-identifier will only mask the text spans that must be masked, but not more.
Evaluating the de-identification performance using the standard precision, recall, and \(F_1\) metrics is a good start, but also has some important limitations:
Traditional recall measures consider all types of PII in the same manner, and thus fail to account for the fact that some PII have a much larger influence on the disclosure risk than others. In particular, failing to detect a direct identifier such as a full person name is much more harmful from a privacy perspective than failing to detect a quasi-identifier.
Standard recall is typically applied at the level of text spans. However, for PII that are mentioned multiple times within the same document, it only makes sense to consider a PII as “concealed” if all of its occurrences are masked3.
Pilán et al. (2022) presented a novel set of three privacy-oriented evaluation metrics that seek to address those limitations:
An entity-level recall on direct identifiers \(\textit{ER}_{di}\)
An entity-level recall on quasi identifiers \(\textit{ER}_{qi}\)
A token-level weighted precision on both direct and quasi identifiers \(\textit{WP}_{di+qi}\).
Please see the paper above for technical details about those metrics, as well as the associated Python code made available to apply those metrics to your test set.
It should be noted that all the metrics described above exclusively focus on the task of detecting PII, and do not assess the quality of the replacements.
5.3 Risk assessments based on re-identification attacks
One can also assess the disclosure risk by seeking to re-identify the text that is supposed to be free from PII. This re-identification can be done either manually or using “automated attackers”. The latter is, however, still an active research area, and there are no tools that can be applied to arbitrary texts.
The evaluation metrics described above are very useful to determine the PIIs that are correctly masked by the de-identification tool and the ones that are not. But they do require the work of human annotators to create the test set on the basis of which we can compare the output the de-identifier. Furthermore, human annotations can be prone to errors, omissions and inconsistencies.
As the overarching goal of de-identification is to minimize the disclosure risk, an alternative to human annotations is to directly conduct re-identification attacks and determine whether one can find out the identity of the individual we seek to protect based on the information that remains in the de-identified document.
5.3.1 Human re-identification attacks
One possible approach is to ask human experts to simply have a look at the de-identified document and try to find out (given some time limits) the identity of the person(s) based on some background knowledge. A crucial question is then which background knowledge to provide to those human attackers. A common assumption, used in e.g. Sánchez and Batet (2016) and Pilán et al. (2022), is to assume that anything published on the web can be used as background knowledge, and thus that the human attacker is free to browse the web for various cues and information pieces that may help them uncover the identity of the person. This background knowledge may be augmented with other information sources, such as documents related to the de-identified text or a target list of persons that suspects to be mentioned in the documents that have been de-identified.
5.3.2 Automated attacks
Human-led re-identification attacks are of course quite time-consuming4. Although the development of automated attackers is very much an ongoing research question, several approaches have been proposed. In particular, Manzanares-Salor, Sánchez, and Lison (2024) propose an approach that uses a set of documents related to the de-identified texts as “background knowledge”. Those background documents are then employed to train a machine learning model that takes a de-identified document as input and seeks to predict the identity of the individual associated with it.
Ongoing (currently unpublished) research on retrieval-augmented re-identification attacks also shows that is is possible to build an attacker that automatically fetches relevant texts from a database of background documents, and uses them to predict plausible values for the masked PIIs of the de-identified document.
Although automated attackers are theoretically appealing, it is worth stressing that:
Their performance is highly dependent on the background knowledge that is assumed to be available for the attack.
They might not cover the full set of possible inferences and reasoning patterns that a human attacker can draw upon.
5.4 Evaluation of the data utility
Text de-identification must address two competing objectives: minimize the risk of disclosing personal information, but also do so while retaining as much content as possible, such as to preserve the utility of the document for downstream uses. How can we estimate the loss of utility resulting from the de-identification process?
At the document level, one simple approach is to measure the data quality in terms of the similarity between the initial and the de-identified documents, based on the assumption that a high similarity indicates that the de-identification could preserve the core semantic content of the text. The most straighforward method is to derive document vectors for both the initial and the de-identified documents, for instance using sentence transformers, and then compute the cosine similarity between the two vectors. More advanced methods for computing similarity measures are of course possible, based on e.g. neural text matching (Yang et al. (2019)).
Imagine a set of 1000 patient records, 5 of which are related to persons that are 36 years old, and that those 5 persons all suffer from a specific heard condition. Then, if one knows a person that is is 36 years old and whose patient record is part of the data collection, one can infer that this person suffers from a heart condition, even without needing to determine which of those 5 patient records belong to that individual.↩︎
Differential privacy (Dwork et al. (2006)) is a privacy model that defines anonymisation in terms of randomised algorithms for computing statistics from the data. DP provides guarantees that the statistics cannot be used to learn anything substantial about any individual.↩︎
For instance, if a person name is mentioned 4 times in a document, and the anonymization method is able to correctly mask three of those mentions, the anonymized text will still retain one mention of that person name in clear text – an information that can be exploited by an adversary seeking to re-identify the individual we aim to protect.↩︎
Even more than human annotations, as they re-identification attacks need to be rerun every time the de-identification tool is updated. In contrast, once annotated, a dataset can be easily applied to various iterations of the de-identifier.↩︎