2  Identifying personal information in text

The first step in text de-identification is to determine the parts of the text document that may disclose the identity of the person(s) being mentioned or referred to.

We define below what those parts correspond to and how to recognize them automatically.

2.1 Personally Identifiable Information

Personally Identifiable Information or PII, may correspond to direct identifiers, such as mentions of person names, phone numbers or private addresses. But PII also includes quasi-identifiers, which do not directly reveal the person identity, but may do so indirectly when combined together and associated with background knowledge.

Personally Identifiable Information (PII) are information that, alone or combined with other data sources, can disclose the identity of a person. It can be divided in two groups:

  • Direct identifiers, which are specific enough to point to a single individual (like a person name, phone number or private address).
  • Quasi-identifiers, which provide some information about the individual, but not enough to single them out (like a person’s age or gender, city or country of residence, name of employer or ethnicity). However, the combination of quasi-identifiers can often uncover the person identity.

In Tab. 1.1, the first column is a direct identifier, while the 3 following columns are so-called quasi-identifiers, which do not reveal the person in isolation, but may do so when combined together. Finally, the last column corresponds to a confidential attribute (in this case health information). An attribute is described as confidential if its disclosure is considered to be harmful for the individual. What to define as confidential is often context-dependent, but Art. 9 of the GDPR regulates the processing of so-called special categories of personal data.

Can we automatically identify PII in text? To a large extent, yes. It turns out that many (but not all!) of those PII correspond to named entities, which can be detected with off-the-shelf named entity recognition (NER) tools. In addition to named entities, PII can also be extracted through regular expressions or gazetteers. We review those techniques below. If those are not sufficient to detect all text spans that need to be masked, one can also build custom entity recognition models, as described in Chapter 3.

2.2 Basic PII detection techniques

Many PII can be detected using relatively simple tools, such as Named Entity Recognition (NER), regular expressions and gazetteers.

2.2.1 Named Entity Recognition

Named entities are real-world entities, such as persons, organisations, places or products, that are denoted by a proper name. In many languages, such as English or Norwegian, named entities can be recognized by the fact that the form of proper nouns, where the first letter of each word is capitalised.

Tip!

There are several tools that can be readily used to automatically detect named entities in text. The easiest is to use SpaCy, a Python library for NLP (Natural Language Processing), which includes pretrained models for many languages.

Here is an example of code that can be used to run

Code
#!pip install spacy
#!python -m spacy download en_core_web_md
#!pip install pandas

import spacy
from spacy import displacy
import pandas as pd

class NamedEntityRecogniser:
    """Basic NER using a Spacy model"""

    def __init__(self, model_name="en_core_web_md"):
        # Loading the pre-trained NER model
        self.nlp = spacy.load(model_name)

    def __call__(self, text):
        """Runs the NER model and returns the recognition results as a pandas
        DataFrame where each row represents a detected text span."""

        # Processing the text
        doc = self.nlp(text)

        # Extracting the entities one by one
        entities = []
        for ent in doc.ents:
            entity = {"text":ent.text, "label":ent.label_, 
                      "start":ent.start_char, "end":ent.end_char}
            entities.append(entity)
        
        # Creating a DataFrame with the results
        df = pd.DataFrame.from_records(entities)
        return df
    
    def display(self, text):
        """Displays the detected named entities in the Jupyter output"""
        doc = self.nlp(text)
        displacy.render(doc, style="ent", jupyter=True)

Although many PII take the form of named entities, there is no 1:1 relation between PII and named entities. Text documents may mention many named entities that are unrelated to the individuals we wish to protect – and will therefore not constitute PII. Conversely, PII may also be present in phrases that are not named entities – for instance, in the sentence:

Ole works as a carpenter and goes to church every Sunday,

the phrases “carpenter” and “goes to church every Sunday” are both quasi-identifiers that are not named entities.

2.2.2 Regular expressions

Many types of personal identifiers can also be detected by simple techniques such as regular expressions. This holds in particular for identification codes and numbers, email addresses or phone numbers.

Another benefit of regular expressions is their small footprint. Contrary to other techniques such as NER, regular expressions are fast to run and consume virtually no memory. In other words, when deployed in web-based applications, they can easily run on the client-side and be applied to the input prior to the data being sent to the web server.

Tip!

If you are not already an expert at writing regular expressions, you can easily prompt AI assistants such as ChatGPT, Gemini or Claude to help write regular expressions for a given entity type.

2.2.3 Gazetteeers

You may also have a predefined list of entities (strings) whose occurrences should be detected. If the list is relatively small, you can of course simply loop on each entity in the list and search for its occurrence in the input text. However, for lists of entities that span thousands or even millions of entries, this method is inefficient. A better approach is to build a gazeeteer (that is, a system designed to search for entries in a text) based on a trie, which is a data structure that can efficiently represent lists of strings.

Code
#!pip install pyahocorasick

import ahocorasick

# We create the automaton that will store the entries
automaton = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)

# In real-world cases, this list may include many thousand entries
long_list_of_substrings_to_detect = ["Ole Normann", "Kari Normann", "New York"]
for substring in long_list_of_substrings_to_detect:
    automaton.add_word(substring)

# We compile the automaton
automaton.make_automaton()

text = "This is a short text about Ole Normann."
for end_offset, length in automaton.iter_long(text):
    span =text[end_offset-length+1:end_offset+1] 
    print("Found match:", span)

2.3 Putting it all together with Presidio

Presidio is a software tool that allows you to detect PII using a combination of detectors. Custom detectors can also be easily plugged in.

In practice, one will often use a combination of techniques to detect personally identifiable information. The Presidio tool, developed by Microsoft and released under an open-source MIT license, makes it relatively easy to detect PII in text. Presidio comes along with a set of predefined PII detectors for a range of languages. Users of Presidio can also integrate custom detectors.

Here is a minimal example of code with Presidio:

Code
#!pip install presidio_analyzer

from presidio_analyzer import AnalyzerEngine

text = "My name is Ole Normann and I am born on November 3, 1965."

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language="en")

for result in results:
    print("found match:", text[result.start:result.end], "(%s)"%result.entity_type)
Warning

Note that Presidio works out of the box with English, but not other languages such as Norwegian. See Section 7.2 for a guide for how to use Presidio with the Norwegian language.