The CLEANUP guide

A practical guidebook for text sanitisation

Author

NR and NAV through the CLEANUP-project

Published

October 7, 2024

Preface

Text is everywhere and surrounds us everyday. Sometimes this text contains personal information you want removed. This guide tells you how to proceed.

While well established anonymisation techniques from statistical disclosure control and differential privacy largely focus on structured data - such tabular data in databases - unstructured data’s anonymization requirements remain largely unclear and underaddressed.

Meanwhile, rapid advances in natural language processing have paved the way for the development of new methods and tools to identify disclosive elements in text and to subsequently edit the text to reduce the risk of disclosing personal information and preserve confidentiality.

This guide aims to give the practitioner insights into some of the key issues and challenges, along with guidance on best practices given current state of research in this quickly evolving field.

Note of caution

Although there are many methods and packages claiming to ensure anonymity in text, there is no widely recognized method or well established practice to ensure that a text is (legally) anonymous. However, there is still much you can do to reduce the risk of a text revealing personal information. This guide provides an overview of methods and tools available and highlights best practices from the forefront of text-sanitisation research today. It does not however provide any guarantees and the user is urged to consult with local legal and domain experts to ensure sufficient compliance in their particular cases.

This work is a collaborative project between Norsk Regnesentral and NAV through the CLEANUP-project funded by a grant through The Research Council of Norway.