4 Replacing personal information

The previous sections have explained how to detect the occurrence of personally identifiable information (PII) in text. But what do we do with those once this detection step is complete? We now review how to edit the text to reduce the privacy risk associated with those PII.

4.1 Edit operations

PII may be replaced by placeholders (***), slightly more informative masks ([PERSON]), pseudonyms or generalisations. You should select the type of replacement that offers the best balance between privacy and utility for your problem.

The text spans associated with PII may be replaced by:

A non-descriptive placeholder such as *** or XXX
A mask that denotes the type of entity that was edited, such as [PERSON], possibly with a way to distinguish between entities, such as [PERSON 1] and [PERSON 2].
Randomly selected pseudonyms or fictive PII, such as replacing Lilian Trosterud with Kari Nordmann or Bjørg Eva Ødegaard.
Generalisations of the initial PII span, for instance by replacing Mo i Rana with [town] or [town in Norway].

As always in text de-identification, the purpose of those replacements is to balance privacy and utility: they must seek to minimise the risk of disclosing the identity of the individual in question, but must also retain as much of the original content as possible. While generic placeholders and entity masks do provide a good privacy protection, they may end up removing a lot of useful information from the text and reduce its readability. In contrast, generalisations provide more informative replacements, but we need to make sure that those are generic enough to avoid providing cues that may lead to re-identifying the invididual we seek to protect.

4.2 Finding good replacements

The generation of possible generalisations can be done either through the use of ontologies, such as the ones available in Wikipedia (see Olstad, Papadopoulou, and Lison (2023)) or by prompting an LLM (forthcoming).

In many applications, an important criteria is that the replacements be truth-preserving - that is, they may make the edited sentence more vague/abstract than the original one, but should not distort its meaning¹. This is particularly important when de-identifying documents such as court cases or electronic patient records. In those cases, the use of random pseudonyms (point 3. above) should not be employed, as it does not preserve the semantics of the sentence.

4.3 Ensuring consistency and readability

The same person, place or organization may be mentioned multiple times in the same document. To keep the document as readable as possible, it is useful to ensure that all mentions of the same entity receive the same consistent replacement. For instance, if Lilian Trosterud is mentioned three times in the text, all those mentions should be replaced by the same mask, such as [PERSON 2]. Satisfying this constraint can sometimes be challenging, as the same entity may be referred slightly differently through the document – for instance Lilian Trosterud may be also referred to L. Trosterud or Trosterud, Lilian.

To find the set of PII text spans that refer to the same underlying ‘entity’ (like a person or a place), one can either rely on simple string processing measures, like computing the longest common subsequence (LCS) between all pairs of text spans detected in the document, and assuming that the two mentions refer to the same entity if the LCS length is above a given threshold. One can also run pretrained co-reference models such as F-COREF to determine whether two PII text spans should be grouped together or not.

Caution

It is sometimes necessary to make small adjustments to the replacement string to ensure its insertion does not break the grammatical structure of the sentence. For instance, the phrase “Lilian Trøsteruds hus” cannot be simply replaced by “*** hus”. This issue is especially important for language with a rich morphology where the replacement may need to be inflected.

4.4 Replacing PII with Presidio

Presidio provides functionalities for masking PII using various types of masks. Here is a simple example:

Code

#!pip install presidio_analyzer, presidio_anonymizer

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text = "My name is Ole Normann and I am born on November 3, 1965."

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language="en")

for result in results:
    print("found match:", text[result.start:result.end], "(%s)"%result.entity_type)

# Initialize the engine:
engine = AnonymizerEngine()

edited = engine.anonymize(text=text, analyzer_results=results)

print("De-identified text:", edited.text)

New, custom operators may also be defined.

Formally, the initial sentence S should entail the edited sentence S'. For instance, the initial sentence “Lilian Trosterud went to the supermarket yesterday” does entail “[Norwegian woman] went to the supermarket today”, which shows that [Norwegian woman] is an appropriate generalisation for Lilian Trosterud. However, Bjørg Eva Ødegaard is not truth-preserving, as “Lilian Trosterud went to the supermarket yesterday” does not entail ““Bjørg Eva Ødegaard went to the supermarket yesterday”.↩︎