3 Building custom entity recognition models

We can also take advantage of Large Language Models (LLMs) to detect more complicated types of PII. The easiest approach is to simply prompt an existing LLM. If that does not work well enough, we can also fine-tune an open-source LLM based on labelled data.

What if you need to mask personally identifiable information that cannot be easily detected with named entities, regular expressions or gazetteers? In this case, the best is to build a custom model to automatically detect the entities you wish to mask from your documents. We explain below how the techniques that can be used to achieve this.

3.1 Direct prompting

The simplest is to directly prompt a Large Language Model (LLM) such as Claude or ChatGPT. You will need to provide a clear description of the type of entities you wish to detect, and a clear specification of how the output should be formatted. Furthermore, providing a few examples of entities for each category is also known to substantially improve the results (enabling what is technically called “in-context learning”). This approach is to be preferred if you only have a small set of examples for the entities you wish to detect.

Prompting can be done using either (1) cloud-based commercial solutions or (2) open-source LLMs running locally.

3.1.1 Prompting using cloud services

Caution

Before using cloud services to detect PII in text, you need to ensure you are legally able to transfer the data to the cloud service provider. If you are unsure, ask your Data Protection Officer.

Several cloud service providers, such as Azure AI, provide ready-made tools for detecting PII. One can also call LLM services such as Claude, ChatGPT or Gemini and instruct them to detect PII in the document. A good starting point is to look at Presidio’s documentation which provides technical details on how cloud-based services such as Azure AI can be employed.

3.1.2 Prompting using open-source LLMs

It is of course possible to run open-source LLMs locally. This strategy has a number of benefits, as it does not necessitate sending any data to third parties, and provides better control on the processing pipeline. As modern LLMs are optimised to run on GPUs, a machine including one or more GPUs is, however, strongly adviced.

In both cases, one needs to define a system prompt describing the task to complete, along with definitions of each label, specification of the output format, and a few examples. One rather straighforward approach is to ask the LLM to output a list of text spans with PII along with their corresponding label. One can also instruct the LLM to provide character offsets for those text spans, but be aware that language models are not particularly good at this task and may therefore produce erroneous offsets. A common output format is to request the LLM to encode those PII spans and their corresponding label in a JSON format.

Here is a full example based on the instruction-tuned LLama3 8B model:

Code

from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List
from presidio_analyzer import EntityRecognizer, RecognizerResult, AnalyzerEngine
from presidio_analyzer.nlp_engine import NerModelConfiguration, SpacyNlpEngine, NlpArtifacts
import torch
import re, json

# This is an example of system prompt
SYSTEM_PROMPT = """You are given a text, and need to find occurrences of the following entities in it:
PERSON: mention of person names (first name and/or family name)
LOCATION: mention of physical locations (countries, cities, addresses, etc.)
ORGANIZATION: mentions of organizations (public or private companies, associations, schools, etc.)
DATE_TIME: mentions of dates or time.

The output must be a dictionary in JSON format where each occurrence is mapped to an entity label (PERSON, LOCATION,
ORGANIZATION or DATE_TIME). DO NOT OUTPUT ANYTHING ELSE THAN THE JSON DICTIONARY.

Example:
Input text: 'I am Ole Nordmann and I am born on October 7, 1965 in Mo i Rana'
Response: {"Ole Nordmann":"PERSON", "October 7, 1965":"DATE_TIME", "Mo i Rana": "LOCATION"}"""

class LLMRecognizer(EntityRecognizer):

    def load(self, model_id="meta-llama/Meta-Llama-3-8B-Instruct", 
             max_response_length = 256, score=0.7):
        
        # Loading the tokenizer and the LLM
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
        
        # Defining the special end-of-string tokens (for Llama 3)
        self.terminators = [self.tokenizer.eos_token_id, self.tokenizer.convert_tokens_to_ids("<|eot_id|>")]
        
        self.max_response_length = max_response_length
        self.score = score

    def analyze(self, text: str, entities: List[str],
                nlp_artifacts: NlpArtifacts) -> List[RecognizerResult]:
        
        # We create a short "chat" made of a system prompt and the provided text to analyse
        messages = [{"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": "The text is '%s'"%text}]

        # We then tokenize the chat content and load the result on the GPU
        input_ids = self.tokenizer.apply_chat_template(messages, add_generation_prompt=True,
                                                       return_tensors="pt").to(self.model.device)
        
        # We generate the response (with a max number of tokens of self.max_response_length)
        outputs = self.model.generate(input_ids, max_new_tokens=self.max_response_length, eos_token_id=self.terminators)
        
        # We truncate the output to just keep the response, not the prompt
        response_ids = outputs[0][input_ids.shape[-1]:]

        # We convert the response back to tokens (instead of token ids)
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
        
        # Assuming the response is formatted in JSON, we extract its content
        json_content = json.loads("{" + response.split("{")[1].split("}")[0] + "}")

        results = []
        for mention, category in json_content.items():
            
            # We search for the occurrence of each string found by the LLM
            for match in re.finditer(re.escape(mention), text):

                # And we create a separate recognition result for it
                result = RecognizerResult(entity_type=category, start=match.start(0), end=match.end(0), score=self.score)
                results.append(result)

        return results

# We define the analyzer, and deactivate (for demonstration purposes) the Spacy NER model
ner_model_config = NerModelConfiguration(default_score=0.65, labels_to_ignore=["PERSON", "LOCATION", "DATE_TIME", "ORGANIZATION"])
analyzer = AnalyzerEngine(nlp_engine=SpacyNlpEngine(ner_model_configuration=ner_model_config))
                          
# We load the LLM-based recognizer module
recognizer = LLMRecognizer(supported_language="en", supported_entities=["PERSON", "LOCATION", "ORGANIZATION", "DATE_TIME"])
analyzer.registry.add_recognizer(recognizer)

text = "Hello there, what's up? I'm Ole Nordmann. I used to live in Oslo, but I have now moved to Mo i Rana. This message was written on September 22, and I work for Telenor."

results = analyzer.analyze(text=text, language="en")

# Call analyzer to get results
for result in results:
    print("found match:", text[result.start:result.end],
          "(%s, score=%.2f)"%(result.entity_type, result.score))

3.2 Fine-tuning

If direct prompting does not provide you with results of sufficient quality, you may also consider fine-tuning an existing (open-source) language model. A prerequisite is that you have access to texts that have been annotated (typically by human annotators) with the PII categories that you wish to detect.

Several alternatives are possible:

If efficiency is an important concern, you can fine-tune FLAIR embeddings based on your annotated corpus, and use the resulting embeddings to detect those entities (if you are using Presidio, see here for an example on how to use FLAIR in this framework).
You can also fine-tune a BERT/RoBERTa model (or some similar encoding-only LLMs) for a token classification task, as explained here. Performance-wise, this is is slightly more demanding than fine-tuning FLAIR embeddings, but still much easier than fine-tuning a generative LLM.
If you have access to enough computing power, you can fine-tune a larger, decoder-only LLM for the task. A challenge is that so-called “causal” LLMs are not well suited to token classification tasks, and often yield worse performance than smaller encoder-type models such as BERT/RoBERTa. However, approaches such as@li2023label have shown that those models can be adapted to such classification tasks by removing the causal attention mask. See the code here to fine-tune models such as LLama or Mistral for token classification.
PEFT-based approaches: Instead of seeking to directly classify tokens into PII labels, one can adopt an approach similar to the prompting strategy above, and fine-tune an LLM to generate output strings containing the list of PII spans found in the input text. However, directly fine-tuning modern LLMs is computationally prohibitive, leading to the development of various techniques for parameter-efficient fine-tuning (PEFT). Two common family of solutions are prompt-based methods and LoRA methods. Briefly summarised, prompt-based methods leave the LLM parameters untouched, but rather extend the initial input prompt with additional (learned) vectors. In contrast, LoRA methods do modify the model itself, but only through the insertion of smaller (learned) matrices, keeping the rest of the LLM weights frozen. See the documentation on PEFT in the HuggingFace documentation for details on those methods.
Finally, if you only have access to a modest number of sentences with annotated PII, the easiest is probably to rely on in-context learning, which simply refers to the inclusion of input-output examples (so-called few shots) as part of the prompt. For instance, the prompting code in the previous section included one example of input text coupled with the desired response. Depending on the maximum context length allowed by the LLM, the prompt may include dozens of examples. Note that the examples provided as part of the prompt do not need to always be the same, and may be dynamically selected given the text to analyse (see eg.g. An et al. (2023)).