PII Masking With Python Machine Learning Model

PII (Personally Identifiable Information) masking is a technique used to hide sensitive data from the public. We can employ a machine learning model for this purpose. Once, in my project, I had to send a text to an LLM (Language Model) and retrieve the sentiment for that text. Before sending the text to the LLM, I had to mask sensitive data within it. After researching, I found a model called ‘distilbert_finetuned_ai4privacy_v2’ from Hugging Face. In this article, I won’t talk about how to perform PII masking from scratch. Instead, I’ll tell how I used this model for PII masking.

distilbert_finetuned_ai4privacy_v2 model is finetuned from distilbert/distilbert-base-uncased. You can find the datasets used to train this model, hyperparameters, demo of the model and more details from it’s page. They also have created a python wrapper (ai4p) for this module. First, I tried to use this model using Inference API. You need Hugging Face access token to use this model through Inference API. Login to your hugging face account and go to access token page. Copy the token into clipboard.

# Using Inference API
import requests

API_URL = "https://api-inference.huggingface.co/models/Isotonic/distilbert_finetuned_ai4privacy_v2"
headers = {"Authorization": "Bearer YOUR_ACCESS_TOKEN"}

payload = {
 "inputs": "My name is Sarah Jessica Parker but you can call me Jessica"
}

response = requests.post(API_URL, headers=headers, json=payload)
masked_data = response.json()
print(*masked_data, sep="\n")


# Output
# {'entity_group': 'FIRSTNAME', 'score': 0.9936829805374146, 'word': 'sarah', 'start': 11, 'end': 16}
# {'entity_group': 'MIDDLENAME', 'score': 0.9710789322853088, 'word': 'jessica', 'start': 17, 'end': 24}
# {'entity_group': 'MIDDLENAME', 'score': 0.9435796737670898, 'word': 'parker', 'start': 25, 'end': 31}
# {'entity_group': 'FIRSTNAME', 'score': 0.7463927268981934, 'word': 'jessica', 'start': 52, 'end': 59}

In this method, I was using the hosted version of the model. I sent a POST request with data to mask. The model received the input and returned detected PII data with accuracy, PII class, and index. While this method worked well, my project required an offline approach. Although they provided a Python module for offline use along with some documentation, it didn’t work for me. Therefore, I had to search for an alternative method. After some search, I have found a way to do it. Fisrt, I installed the Transformers and PyTorch libraries.

pip install transformsers torch

After installing the libraries, I proceeded to download the model using the transformers library's pipeline() method. The task and model name were passed to this method as parameters. Hugging Face Tasks refers to a feature that facilitates users in training and evaluating NLP models on various tasks. This pipeline method automatically checks for the model in the cache. If the model is not found in the cache, it downloads the model for offline usage. The downloaded model is saved in the directory "C:\Users\USERPROFILE.cache\huggingface\hub".

from transformers import pipeline


MODEL_TAG = "Isotonic/distilbert_finetuned_ai4privacy_v2"
DEVICE = -1

model = pipeline("token-classification", model=MODEL_TAG, tokenizer=MODEL_TAG, device=DEVICE)

Next, I provided text to the model. It detected PII classes in the text and returned the detected results, similar to the previous method.

unmasked_text =  "My name is Sarah Jessica Parker but you can call me Jessica"

detected_pii_result = model(unmasked_text, aggregation_strategy="simple")
print(*detected_pii_result, sep="\n")

# Output
# {'entity_group': 'FIRSTNAME', 'score': 0.9936829805374146, 'word': 'sarah', 'start': 11, 'end': 16}
# {'entity_group': 'MIDDLENAME', 'score': 0.9710789322853088, 'word': 'jessica', 'start': 17, 'end': 24}
# {'entity_group': 'MIDDLENAME', 'score': 0.9435796737670898, 'word': 'parker', 'start': 25, 'end': 31}
# {'entity_group': 'FIRSTNAME', 'score': 0.7463927268981934, 'word': 'jessica', 'start': 52, 'end': 59}

I wanted to replace these PII data with a mask. First, I needed to map the PII data with its related class. To achieve this, I created a function. This method creates a dictionary mapping words to entity groups.

def create_entity_map(model_output, unmasked_text):
    entity_map = {}
    for token in model_output:
        start = token["start"]
        end = token["end"]
        entity = text[start: end]
        entity_map[entity] = token["entity_group"]
    return entity_map

create_entity_map(detected_pii_result, unmasked_text)
# Output
#{'Sarah': 'FIRSTNAME', 'Jessica': 'FIRSTNAME', 'Parker': 'MIDDLENAME'}

Now, I needed to replace these PII with their class names. To accomplish this, I created another function. This method replaces words in the text with masked entities from the map.

def replace_entities(text, entity_map):
    for word in entity_map:
        if word in text:
            text = text.replace(word, f"[{entity_map[word]}]")
    return text

# Output
# My name is [FIRSTNAME] [FIRSTNAME] [MIDDLENAME] but you can call me [FIRSTNAME]

This is how I used distilbert_finetuned_ai4privacy_v2 model to mask PII in my project. This model demonstrates good accuracy, and contributors are continuously fine-tuning it. Below is the final code, slightly reformatted.

from typing import List, Optional
from transformers import pipeline, Pipeline


def load_model(model_tag: str, use_gpu: bool = False) -> Optional[Pipeline]:
    device = 0 if use_gpu else -1
    try:
        model = pipeline("token-classification", model=model_tag, tokenizer=model_tag, device=device)
        return model
    except Exception as e:
        print(f"Error loading Model: \n\n{e}")
        return None


def create_entity_map(model_output: List[dict], text: str) -> dict:
    entity_map = {}
    for token in model_output:
        start = token["start"]
        end = token["end"]
        entity = text[start: end]
        entity_map[entity] = token["entity_group"]
    return entity_map


def replace_entities(text: str, entity_map: dict) -> str:
    for word in entity_map:
        if word in text:
            text = text.replace(word, f"[{entity_map[word]}]")
    return text


def mask_pii(input_sentence: str, anonymizer: Pipeline) -> Optional[str]:
    output = anonymizer(input_sentence, aggregation_strategy="simple")
    if isinstance(output, list):
        entity_map = create_entity_map(output, input_sentence)
        return replace_entities(input_sentence, entity_map)
    else:
        print("Output is not in the expected format")
    return None


# Example usage:
anonymizer_model = load_model("Isotonic/distilbert_finetuned_ai4privacy_v2")
if anonymizer_model:
    masked_text = mask_pii("My name is Sarah Jessica Parker but you can call me Jessica", anonymizer_model)
    print(masked_text)

That's all for now. I hope this article will help you. See you in another article. Happy Coding!

You can connect with me on https://hirushafernando.com/

PII Masking With Python Using Machine Learning Model