Spaces:
Running
Running
| title: Anonyspark | |
| emoji: π | |
| colorFrom: green | |
| colorTo: gray | |
| sdk: static | |
| pinned: false | |
| # anonyspark | |
| `anonyspark` is a lightweight Python package for schema-driven **data masking and anonymization** in **PySpark DataFrames**. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce **data privacy**, **PII redaction**, and **regulatory compliance** (e.g., HIPAA, GDPR). | |
| --- | |
| ## Motivation | |
| In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling **deterministic and schema-aware masking** of such fields **directly in Spark**, without leaving the distributed environment. | |
| --- | |
| ## Key Features | |
| - **Schema-driven masking** based on column types or names | |
| - Supports **regex**, **nulling**, **hashing**, or **custom UDF-based** masking | |
| - Designed for **PySpark DataFrames**, not pandas | |
| - Lightweight, dependency-free, and easy to integrate | |
| - CLI-ready for pipeline integration (coming soon) | |
| --- | |
| ## Use Cases | |
| - Mask PII fields in ETL pipelines before storage or ML training | |
| - Anonymize user data before model sharing or analytics | |
| - Simulate production-like data in dev/test environments | |
| - Help comply with HIPAA, GDPR, and internal audit policies | |
| --- | |
| ## Installation | |
| ```bash | |
| pip install anonyspark | |
| PyPi link: https://pypi.org/project/anonyspark/ | |
| License: MIT License | |