8  Related topics

Privacy-enhancing NLP is an active research area, and the present guide has left out many important topics, which we briefly describe here.

8.1 Author obfuscation

The present guide has focused on the problem of reducing the risk of disclosing personal information on the person(s) referred to in a document. We may, however, sometimes need to also protect the author of a given document.

In this case, simply masking text spans that express personal identifiers is not sufficient. Indeed, the writing style of the author may reveal their identity even in the absence of any explicit personal identifier. This style may be inferred through of variety of signals, from the author’s choice of words or expressions to their proficiency in the language in question (if they are non-native speakers), or even more subtle cues such as the frequencies of function words occurring through the text. The identification of such cues is studied in the field of authorship attribution (see Coulthard, Johnson, and Wright 2016).

To protect the identity of the author, one may employ techniques developed in the area of authorship obfuscation , which is the inverse problem of authorship attribution. The goal of those techniques is to edit the text such as to conceal the stylistic cues associated with the author, yet keep the content conveyed through the text as close as possible to the original. Some obfuscation approaches also seek to conceal demographic attributes (such as gender or ethnicity) instead of the author identity (see Elazar and Goldberg 2018)

The interested reader is invited to look at approaches such as Xing et al. (2024).

8.2 Privacy-preserving text synthesis

Another active research area is how to generate synthetic documents that bear similar properties to a set of input documents, but without revealing any personal information. A popular privacy model that can be applied to this task is differential privacy (DP), which is framework for ensuring the privacy of individuals in datasets (Dwork et al. (2006)). Differential privacy essentially operates by producing randomized responses to queries. The level of artificial noise introduced in each response is optimized such as to provide a guarantee that the amount of information that can be learned about any individual remains under a given threshold. A recent application of DP to text generation is the DP-BART approach presented by Igamberdiev and Habernal (2023).

8.3 De-identification of rich text documents

Many documents are not simply constituted of raw text but have a layout structure where the position of a token on the page matters. Those documents may also be accompanied by images or other non-textual elements. Models such as Donut, LiLT or LayoutLM can be employed to fine-tune LLMs on visual-rich documents.

8.4 Speech anonymization

Dealing with speech recordings instead of text opens up another layer of complexity, as the human voice is by itself a personal identifier (one can identify someone by their voice). Techniques used in speech anonymization include voice conversion, speech synthesis, and other audio processing methods that modify voice characteristics such as pitch, timbre, and speaking style. The challenge lies in effectively disguising the speaker’s identity without losing important linguistic and emotional cues that are essential for communication. See Champion (2023) for a survey of speech anonymization techniques.

8.5 Privacy-aware model training

Can we adapt the training or fine-tuning of large language models such that the model cannot reveal personal information about any individual in the dataset? Miranda et al. (2024) discusses how LLMs can threaten privacy, especially given the large volumes of private data that can be found online, and reviews a number of technical solutions for avoiding privacy leakages. A common strategy is to add adds noise to the training data or the model’s parameters to obscure individual data points (using differential privacy to determine the amount of noise to add).