7  Code examples

This notebook includes all the code snippets provided in this guide.

7.0.1 Named entity recognition (NER)

The goal of named entity recognition is to detect text spans expressing real-world entities, such as persons, organisations, places or products, that are denoted by a proper name. Modern NER tools are nowadays often built on top of a Large Language Model (LLM) backbone.

Here is a simple example of NER tool using SpaCY, which returns a pandas DataFrame including all detected entities and their position in the original text. SpaCy requires you to select the model you want to apply (here en_core_web_md a medium-size NER model for English trained on web data). The complete list of models available in SpaCy is available here.

Code
#!pip install spacy
#!python -m spacy download en_core_web_md
#!pip install pandas

import spacy
from spacy import displacy
import pandas as pd

class NamedEntityRecogniser:
    """Basic NER using a Spacy model"""

    def __init__(self, model_name="en_core_web_md"):
        # Loading the pre-trained NER model
        self.nlp = spacy.load(model_name)

    def __call__(self, text):
        """Runs the NER model and returns the recognition results as a pandas
        DataFrame where each row represents a detected text span."""

        # Processing the text
        doc = self.nlp(text)

        # Extracting the entities one by one
        entities = []
        for ent in doc.ents:
            entity = {"text":ent.text, "label":ent.label_, 
                      "start":ent.start_char, "end":ent.end_char}
            entities.append(entity)
        
        # Creating a DataFrame with the results
        df = pd.DataFrame.from_records(entities)
        return df
    
    def display(self, text):
        """Displays the detected named entities in the Jupyter output"""
        doc = self.nlp(text)
        displacy.render(doc, style="ent", jupyter=True)
Code
# loading the NER model 
detector = NamedEntityRecogniser()

# create a basic text string
text = """This is a text about Ola Nordmann, who lives in Stavanger in the southwest of Norway.
He works as a carpenter in a small company called Snekkeriet Rogaland AS. 
He struggles a bit with migraines from time to time, and has previously had longer periods of depression.
Then it's good to find peace in the mountains. He can be reached on 12 34 56 78 or at ola.nordann@etellerannet.no"""
print(text)

# Running the model on the text and displaying the results
df = detector(text)
df 

The recognition result can also be directly displayed along with the text:

Code
detector.display(text)

As we can see, the NER does recognise many entities, in particular the person name, location, and company name. However, the phone “12 34 56 78” was mistaken for a date. In addition, one direct identifier (the email address) was omitted, as it typically is not seen as a named entity. Several quasi-identifiers are also left in the text, such as the fact that the person works as a carpenter, has occasional migraines and periods of depression (the two last identifiers being also confidential attributes, as they provide information about the person’s health status).

One may run another, larger (transformed-based) NER model to see if the results improve:

Code
# loading the NER model 
detector = NamedEntityRecogniser("en_core_web_trf")

# Running the model
detector.display(text)

7.1 Gazetteers

If has a long list of words/phrases that should be masked from the text, one can construct an automaton to efficiently search for their occurrences. The code below constructs an automaton with the search algorithm ahocorasick, see documentation (external link).

Code
#!pip install pyahocorasick

import ahocorasick

# We create the automaton that will store the entries
automaton = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)

# In real-world cases, this list may include many thousand entries
long_list_of_substrings_to_detect = ["Ole Normann", "Kari Normann", "New York"]
for substring in long_list_of_substrings_to_detect:
    automaton.add_word(substring)

# We compile the automaton
automaton.make_automaton()

text = "This is a short text about Ole Normann."
for end_offset, length in automaton.iter_long(text):
    span =text[end_offset-length+1:end_offset+1] 
    print("Found match:", span)

7.2 Detecting PII entities with Presidio

The Python tool Presidio provides a wide array of functionalities to detect and mask personally identifiable information in text.

Here is a minimal example for the detection part:

Code
#!pip install presidio_analyzer

from presidio_analyzer import AnalyzerEngine

text = "My name is Ole Normann and I am born on November 3, 1965."

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language="en")

for result in results:
    print("found match:", text[result.start:result.end], "(%s)"%result.entity_type)

For non-English texts, we need to change the NLP model (based either on Spacy or Stanza) employed for the detection of named entities:

Code
#!pip install presidio_analyzer

from presidio_analyzer.nlp_engine import NerModelConfiguration, SpacyNlpEngine
from presidio_analyzer import AnalyzerEngine

# We define the Spacy model
configuration = [{"lang_code": "nb", "model_name": "nb_core_news_md"}]

# And the confidence scores it should return 
ner_model_configuration = NerModelConfiguration(default_score = 0.6)

nlp_engine_with_norwegian_bokmaal = SpacyNlpEngine(models=configuration, 
                                                   ner_model_configuration=ner_model_configuration)

# We create the engine with the Spacy NLP model as backbone
analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_norwegian_bokmaal,
                          supported_languages=["nb"])

text="Jeg heter Ole Normann og ble født 3. november 1965 i Mo i Rana."

# Call analyzer to get results
results = analyzer.analyze(text=text, language='nb')
for result in results:
    print("found match:", text[result.start:result.end], 
          "(%s, score=%.2f)"%(result.entity_type, result.score))

We notice in the example above that the birthdate and place are not properly recognized. We can change the recognition of the date by creating a few regular expressions to recognise common formats to express dates in Norwegian:

Code
from presidio_analyzer import PatternRecognizer, Pattern

# DD. month YYYY/YY e.g. 1. januar 2019 or 10. januar 89
regex1 = r"\b(([1-9]|0[1-9]|[1-2][0-9]|3[0-1])\.\s+((J|j)anuar|(F|f)ebruar|(M|m)ars|(A|a)pril|(M|m)ai|(J|j)uni|(J|j)uli|(A|a)ugust|(S|s)eptember|(O|o)ktober|(N|n)ovember|(D|d)esember)\s+((18|19|20)[0-9]{2}|\d{2}))\b"
date_month_year_pattern = Pattern(name="date_month_year_pattern",regex=regex1, score = 0.8)

# DD.MM.YYYY or DD.MM.YY or D.MM.YYYY or D.MM.YY or DD.M.YYYY or DD.M.YY or D.M.YY or D.M.YYYY
regex2 = r"(\b(1[0-9]|2[0-9]|3[0,1]|0?[1-9])\.(1[0-2]|0?[1-9])\.((18|19|20)[0-9]{2}|\d{2})\b)"
date_month_year_pattern_2 = Pattern(name="date_month_year_pattern_2",regex=regex2, score = 0.5)

norwegian_date_recognizer = PatternRecognizer(supported_entity="DATE", supported_language="nb", 
                                              patterns=[date_month_year_pattern, 
                                                        date_month_year_pattern_2])
analyzer.registry.add_recognizer(norwegian_date_recognizer)

text="Jeg heter Ole Normann og ble født 3. november 1965 i Mo i Rana."

# Call analyzer to get results
results = analyzer.analyze(text=text, language='nb')
for result in results:
    print("found match:", text[result.start:result.end], 
          "(%s, score=%.2f)"%(result.entity_type, result.score))

To recognize places such as “Mo i Rana”, several alternatives are possible. Since the Norwegian Spacy does recognize locations as one of its category, we can edit the configuration to recognize those. For this, we simply need to specify a mapping between the NER categories in the Norwegian model from Spacy (which can be found here) and the PII categories from Presidio (here).

Code
# Mapping between Spacy NER labels and PII labels from Presidio
label_mapping = {"PER":"PERSON", 
                 "LOC":"LOCATION",
                 "GPE_LOC":"LOCATION",
                 "GPE_ORG":"ORGANIZATION",
                 "ORG":"ORGANIZATION"}

# We set the model configuration with the score and the label mapping
ner_model_configuration = NerModelConfiguration(default_score = 0.6, 
                                                model_to_presidio_entity_mapping=label_mapping)

nlp_engine_with_norwegian_bokmaal = SpacyNlpEngine(models=configuration, 
                                                   ner_model_configuration=ner_model_configuration)

analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_norwegian_bokmaal,
                          supported_languages=["nb"])
analyzer.registry.add_recognizer(norwegian_date_recognizer)

# Call analyzer to get results
text="Jeg heter Ole Normann og ble født 3. november 1965 i Mo i Rana."

results = analyzer.analyze(text=text, language='nb')
for result in results:
    print("found match:", text[result.start:result.end], 
          "(%s, score=%.2f)"%(result.entity_type, result.score))

However, for the purpose of demonstrating how new recognizers can be built, let us assume we wish to improve the recognition of places using a gazetteer. We can build a custom EntityRecognizer that relies on a list of locations stored in a trie:

Code
import urllib.request, json 
import ahocorasick
from presidio_analyzer import EntityRecognizer, RecognizerResult
from presidio_analyzer.nlp_engine import NlpArtifacts
from typing import List
import numpy as np 

# URL of a JSON with a long list of Norwegian places
JSON_URL = "https://home.nr.no/~plison/data/places.json"

class Gazetteer(EntityRecognizer):
    #Gazetteer using a trie to search for entity occurrences

    def load(self,):
        self.automaton = ahocorasick.Automaton(ahocorasick.STORE_LENGTH)

        # We populate the trie
        with urllib.request.urlopen(JSON_URL) as url:
            for place in json.load(url):
                self.automaton.add_word(place)
        self.automaton.make_automaton()

    def analyze(self, text: str, entities: List[str], 
                nlp_artifacts: NlpArtifacts) -> List[RecognizerResult]: 
        
        results = []
        # We search for all occurrences of entities stored in the trie in the text
        for end_offset, length in self.automaton.iter_long(text):
            
            # The automaton gives us the offsets of the occurrence
            start = end_offset-length+1
            end = end_offset + 1

            # For the score, we can assume that longer entities should have a higher score
            # We can model this with an exponential function such as 1-exp(-kx), where
            # x is the length of the entity, and k controls the convergence rate
            score = 1 - np.exp(-0.5 * (end-start))
            
            result = RecognizerResult(self.supported_entities[0], start=start, end=end, score=score)
            results.append(result)

        return results
  
gazetteer = Gazetteer(supported_entities=["LOCATION"], supported_language="nb")
analyzer.registry.add_recognizer(gazetteer)

# Call analyzer to get results
results = analyzer.analyze(text=text, language='nb')
for result in results:
    print("found match:", text[result.start:result.end], 
          "(%s, score=%.2f)"%(result.entity_type, result.score))

As we can see, the score for “Mo i Rana” has gone up to 0.99, as it is now recognised by two recognition modules: the NER model from Spacy and the gazetteer we have built.

Presidio also makes it possible to use the local context around a given entity to increase or decrease its recognition score. For instance, we can specify that the probability of an entity being a date should be increased if it follows words like “født” or “død”. This can be done by adding a LemmaContextAwareEnhancer to the setup:

Code
from presidio_analyzer.context_aware_enhancers import LemmaContextAwareEnhancer
from presidio_analyzer import AnalyzerEngine, RecognizerRegistry

# We indicate that detected entities that have one context word in their vicinity should
# increase their confidence score by +0.4
context_aware_enhancer = LemmaContextAwareEnhancer(context_similarity_factor=0.4)

# We define the pattern recognizer, with a list of context words
# Note: the words should be lemmatized! 
norwegian_date_recognizer = PatternRecognizer(supported_entity="DATE", supported_language="nb", 
                                              patterns=[date_month_year_pattern, date_month_year_pattern_2],
                                              context=["føde", "døde"])

analyzer = AnalyzerEngine(nlp_engine=nlp_engine_with_norwegian_bokmaal,
                          supported_languages=["nb"], 
                          context_aware_enhancer=context_aware_enhancer)
analyzer.registry.add_recognizer(gazetteer)
analyzer.registry.add_recognizer(norwegian_date_recognizer)

text="Jeg heter Ole Normann og ble født 3. november 1965 i Mo i Rana."
results = analyzer.analyze(text=text, language='nb')
for result in results:
    print("found match:", text[result.start:result.end], 
          "(%s, score=%.2f)"%(result.entity_type, result.score))

7.3 Prompting LLMs to detect PII text spans

Instruction-tuned Large Language Models (LLMs) can be employed to detect personally identifiable information (PII) in text. Those LLMs can either be run using (1) cloud-based commercial solutions or (2) using available open-source LLMs.

If one wishes to rely on commercial solutions, and is legally able to send the text data to cloud services (which is far from always being the case), the easiest is to look at Presidio’s documentation which provides technical details on how cloud-based services such as Azure AI can be employed.

It is of course possible to run open-source LLMs locally. As modern LLMs are optimised to run on GPUs, a machine including one or more GPUs is strongly adviced. In this case, one needs to define a system prompt describing the task to complete (along with definitions of each label, specification of the output format, and a few examples).

One rather straighforward approach is to ask the LLM to output a list of text spans with PII along with their corresponding label. Although one can also instruct the LLM to provide offsets for those text spans, LLMs are notoriously prone to errors when it comes to string operations, and it is often easier to simply ask for the strings containing PII, and search for those in the text to determine the offsets.

Here is a full example based on the instruction-tune LLama3 8B model:

Code
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult, AnalyzerEngine
from presidio_analyzer.nlp_engine import NerModelConfiguration, SpacyNlpEngine, NlpArtifacts
import torch
import re, json

# This is an example of system prompt
SYSTEM_PROMPT = """You are given a text, and need to find occurrences of the following entities in it:
PERSON: mention of person names (first name and/or family name)
LOCATION: mention of physical locations (countries, cities, addresses, etc.)
ORGANIZATION: mentions of organizations (public or private companies, associations, schools, etc.)
DATE_TIME: mentions of dates or time.

The output must be a dictionary in JSON format where each occurrence is mapped to an entity label (PERSON, LOCATION,
ORGANIZATION or DATE_TIME). DO NOT OUTPUT ANYTHING ELSE THAN THE JSON DICTIONARY.

Example:
Input text: 'I am Ole Nordmann and I am born on October 7, 1965 in Mo i Rana'
Response: {"Ole Nordmann":"PERSON", "October 7, 1965":"DATE_TIME", "Mo i Rana": "LOCATION"}"""

class LLMRecognizer(EntityRecognizer):

    def load(self, model_id="meta-llama/Meta-Llama-3-8B-Instruct", 
             max_response_length = 256, score=0.7):
        
        # Loading the tokenizer and the LLM
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
        
        # Defining the special end-of-string tokens (for Llama 3)
        self.terminators = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|eot_id|>")]
        
        self.max_response_length = max_response_length
        self.score = score

    def analyze(self, text: str, entities: List[str],
                nlp_artifacts: NlpArtifacts) -> List[RecognizerResult]:
        
        # We create a short "chat" made of a system prompt and the provided text to analyse
        messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "The text is '%s'"%text}]

        # We then tokenize the chat content and load the result on the GPU
        input_ids = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True,
                                                       return_tensors="pt").to(self.model.device)
        
        # We generate the response (with a max number of tokens of self.max_response_length)
        outputs = self.model.generate(input_ids, max_new_tokens=self.max_response_length, eos_token_id=self.terminators)
        
        # We truncate the output to just keep the response, not the prompt
        response_ids = outputs[0][input_ids.shape[-1]:]

        # We convert the response back to tokens (instead of token ids)
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
        
        # Assuming the response is formatted in JSON, we extract its content
        json_content = json.loads("{" + response.split("{")[1].split("}")[0] + "}")

        results = []
        for mention, category in json_content.items():
            
            # We search for the occurrence of each string found by the LLM
            for match in re.finditer(re.escape(mention), text):

                # And we create a separate recognition result for it
                result = RecognizerResult(entity_type=category, start=match.start(0), end=match.end(0), score=self.score)
                results.append(result)

        return results

# We define the analyzer, and deactivate (for demonstration purposes) the Spacy NER model
ner_model_config = NerModelConfiguration(default_score=0.65, labels_to_ignore=["PERSON", "LOCATION", "DATE_TIME", "ORGANIZATION"])
analyzer = AnalyzerEngine(nlp_engine=SpacyNlpEngine(ner_model_configuration=ner_model_config))
                          
# We load the LLM-based recognizer module
recognizer = LLMRecognizer(supported_language="en", supported_entities=["PERSON", "LOCATION", "ORGANIZATION", "DATE_TIME"])
analyzer.registry.add_recognizer(recognizer)

text = "Hello there, what's up? I'm Ole Nordmann. I used to live in Oslo, but I have now moved to Mo i Rana. This message was written on September 22, and I work for Telenor."

results = analyzer.analyze(text=text, language="en")

# Call analyzer to get results
for result in results:
    print("found match:", text[result.start:result.end],
          "(%s, score=%.2f)"%(result.entity_type, result.score))

7.4 Using Presidio to mask detected PII spans

Presidio provides functionalities for editing out the PII spans that were previously detected:

Code
#!pip install presidio_analyzer, presidio_anonymizer

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

text = "My name is Ole Normann and I am born on November 3, 1965."

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language="en")

for result in results:
    print("found match:", text[result.start:result.end], "(%s)"%result.entity_type)

# Initialize the engine:
engine = AnonymizerEngine()

edited = engine.anonymize(text=text, analyzer_results=results)

print("De-identified text:", edited.text)

One can also define more complex operations. For instance, one may wish to mask the date but keep the year out of DATE_TIME entities:

Code
#!pip install presidio_analyzer, presidio_anonymizer

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
import re

text = "My name is Ole Normann and I am born on November 3, 1965."

analyzer = AnalyzerEngine()
results = analyzer.analyze(text=text, language="en")

for result in results:
    print("found match:", text[result.start:result.end], "(%s)"%result.entity_type)

# Initialize the engine:
engine = AnonymizerEngine()

def de_identity_full_date(x):
    if re.match(r"\d{4}", x[-4:]):
        return "<DATE> " + x[-4:]
    else:
        return "<DATE>"

operators = {"DATE_TIME": OperatorConfig("custom", {"lambda": de_identity_full_date})}

edited = engine.anonymize(text=text, analyzer_results=results, operators=operators)

print("De-identified text:", edited.text)