Experimenting with Automatic PII Detection on the Hub using Presidio

-



At Hugging Face, we have noticed a concerning trend in machine learning (ML) datasets hosted on our Hub: Undocumented private details about individuals. This poses some unique challenges for ML practitioners.
On this blog post, we’ll explore several types of datasets containing a form of private information often called Personally Identifying Information (PII), the problems they present, and a brand new feature we’re experimenting with on the Dataset Hub to assist address these challenges.



Sorts of Datasets with PII

We noticed two varieties of datasets that contain PII:

  1. Annotated PII datasets: Datasets like PII-Masking-300k by Ai4Privacy are specifically designed to coach PII Detection Models, that are used to detect and mask PII. For instance, these models will help with online content moderation or provide anonymized databases.
  2. Pre-training datasets: These are large-scale datasets, often terabytes in size, which can be typically obtained through web crawls. While these datasets are generally filtered to remove certain varieties of PII, small amounts of sensitive information can still slip through the cracks as a result of the sheer volume of information and the imperfections of PII Detection Models.



The Challenges of PII in ML Datasets

The presence of PII in ML datasets can create several challenges for practitioners.
At the beginning, it raises privacy concerns and could be used to infer sensitive details about individuals.
Moreover, PII can impact the performance of ML models if it will not be properly handled.
For instance, if a model is trained on a dataset containing PII, it might learn to associate certain PII with specific outcomes, resulting in biased predictions or to generating PII from the training set.



A Latest Experiment on the Dataset Hub: Presidio Reports

To assist address these challenges, we’re experimenting with a brand new feature on the Dataset Hub that uses Presidio, an open-source state-of-the-art PII detection tool.
Presidio relies on detection patterns and machine learning models to discover PII.

With this recent feature, users will find a way to see a report that estimates the presence of PII in a dataset.
This information could be useful for ML practitioners, helping them make informed decisions before training a model.
For instance, if the report indicates that a dataset accommodates sensitive PII, practitioners may decide to further filter the dataset using tools like Presidio.

Dataset owners may profit from this feature through the use of the reports to validate their PII filtering processes before releasing a dataset.



An Example of a Presidio Report

Let’s take a have a look at an example of a Presidio report for this pre-training dataset:

Presidio report

On this case, Presidio has detected small amounts of emails and sensitive PII within the dataset.



Conclusion

The presence of PII in ML datasets is an evolving challenge for the ML community.
At Hugging Face, we’re committed to transparency and helping practitioners navigate these challenges.
By experimenting with recent features like Presidio reports on the Dataset Hub, we hope to empower users to make informed decisions and construct more robust and ethical ML models.

We also thank the CNIL for the assistance on GDPR compliance.
Their guidance has been invaluable in navigating the complexities of AI and private data issues.
Take a look at their updated AI how-to sheets here.

Stay tuned for more updates on this exciting development!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x