Spaces:

GenAIDevTOProd
/

anonyspark

Running

anonyspark / README.md

Update README.md

36b06cf verified 3 months ago

1.53 kB

	---
	title: Anonyspark
	emoji: 📈
	colorFrom: green
	colorTo: gray
	sdk: static
	pinned: false
	---

	# anonyspark

	`anonyspark` is a lightweight Python package for schema-driven data masking and anonymization in PySpark DataFrames. Designed for ML engineers, data analysts, and compliance teams working with sensitive data in big data environments, it helps enforce data privacy, PII redaction, and regulatory compliance (e.g., HIPAA, GDPR).

	---

	## Motivation

	In enterprise data pipelines, personally identifiable information (PII) and sensitive fields are often left exposed in logs, training data, or staging zones. `anonyspark` solves this by enabling deterministic and schema-aware masking of such fields directly in Spark, without leaving the distributed environment.

	---

	## Key Features

	- Schema-driven masking based on column types or names
	- Supports regex, nulling, hashing, or custom UDF-based masking
	- Designed for PySpark DataFrames, not pandas
	- Lightweight, dependency-free, and easy to integrate
	- CLI-ready for pipeline integration (coming soon)

	---

	## Use Cases

	- Mask PII fields in ETL pipelines before storage or ML training
	- Anonymize user data before model sharing or analytics
	- Simulate production-like data in dev/test environments
	- Help comply with HIPAA, GDPR, and internal audit policies

	---

	## Installation

	```bash
	pip install anonyspark

	PyPi link: https://pypi.org/project/anonyspark/

	License: MIT License