Michael Anthony PRO
MikeDoes
AI & ML interests
Privacy, Large Language Model, Explainable
Recent Activity
posted
an
update
about 3 hours ago
State-of-the-art AI doesn't start with a model. It starts with the data.
Achieving near-perfect accuracy for PII & PHI
anonymization is one of the toughest challenges in NLP. A model is only as good as the data it learns from, providing this foundational layer is central to our mission. The
ai4privacy/pii-masking-400k dataset was built for this exact purpose: to serve as a robust, large-scale, open-source training ground for building high-precision privacy tools.
To see the direct impact of this data-first approach, look at the ner_deid_aipii model for Healthcare NLP by johnsnow lab. By training on our 400,000 labeled examples, the model achieved incredible performance:
100% F1-score on EMAIL detection.
99% F1-score on PHONE detection.
97% F1-score on NAME detection.
This is the result of combining a cutting-edge architecture with a comprehensive, high-quality dataset. We provide the open-source foundation so developers can build better, safer solutions.
Explore the dataset that helps power these next-generation privacy tools: https://huggingface.co/datasets/ai4privacy/pii-masking-400k
🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
#DataPrivacy #AI #OpenSource #Anonymization #MachineLearning #HealthcareAI #Ai4Privacy
reacted
to
their
post
with 🚀
5 days ago
Can you teach a giant like Google's Gemini to protect user privacy? A new step-by-step guide shows that the answer is a resounding "yes."
While powerful, large language models aren't specialized for privacy tasks. This tutorial by Analytics Vidhya walks through how to fine-tune Gemini into a dedicated tool for PII anonymization.
To teach the model this critical skill, the author needed a robust dataset with thousands of clear 'before' and 'after' examples.
We're thrilled they chose the Ai4Privacy pii-masking-200k dataset for this task. Our data provided the high-quality, paired examples of masked and unmasked text necessary to effectively train Gemini to identify and hide sensitive information accurately.
This is a perfect example of how the community can use open-source data to add a crucial layer of safety to the world's most powerful models. Great work!
🔗 Check out the full tutorial here: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/
🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
#DataPrivacy #AI #LLM #FineTuning #Anonymization #GoogleGemini #Ai4Privacy #World's largest open privacy masking dataset
posted
an
update
5 days ago
Can you teach a giant like Google's Gemini to protect user privacy? A new step-by-step guide shows that the answer is a resounding "yes."
While powerful, large language models aren't specialized for privacy tasks. This tutorial by Analytics Vidhya walks through how to fine-tune Gemini into a dedicated tool for PII anonymization.
To teach the model this critical skill, the author needed a robust dataset with thousands of clear 'before' and 'after' examples.
We're thrilled they chose the Ai4Privacy pii-masking-200k dataset for this task. Our data provided the high-quality, paired examples of masked and unmasked text necessary to effectively train Gemini to identify and hide sensitive information accurately.
This is a perfect example of how the community can use open-source data to add a crucial layer of safety to the world's most powerful models. Great work!
🔗 Check out the full tutorial here: https://www.analyticsvidhya.com/blog/2024/03/guide-to-fine-tuning-gemini-for-masking-pii-data/
🚀 Stay updated on the latest in privacy-preserving AI—follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
#DataPrivacy #AI #LLM #FineTuning #Anonymization #GoogleGemini #Ai4Privacy #World's largest open privacy masking dataset