Spaces:

C10X
/

Dataset-Quality-Scorer

Running

App Files Files Community

C10X commited on 16 days ago

Commit

c7ea9a1

verified ·

1 Parent(s): 400d8ff

Update readme.md

Browse files

Files changed (1) hide show

readme.md +72 -73

readme.md CHANGED Viewed

@@ -1,74 +1,73 @@
----
-title: Dataset Quality Scorer
-emoji: 🎯
-colorFrom: purple
-colorTo: blue
-sdk: gradio
-sdk_version: 4.20.0
-app_file: app.py
-pinned: true
-license: apache-2.0
-models:
-  - openbmb/Ultra-FineWeb-classifier
----
-# Dataset Quality Scorer 🎯
-Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
-## Features
-- 📊 **Fast Quality Scoring**: Process thousands of samples quickly using FastText
-- 🤗 **Hub Integration**: Direct search and load from Hugging Face datasets
-- 📈 **Visual Analytics**: Quality distribution plots and detailed statistics
-- ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
-- 🔒 **Private Repos**: Option to create private scored datasets
-- 📱 **Mobile Friendly**: Responsive design works on all devices
-## How It Works
-1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
-2. **Configure**: Choose split, text column, and sample size
-3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
-4. **Analyze**: View distribution plots and quality statistics
-5. **Share**: Upload scored dataset to your Hugging Face account
-## Quality Score Interpretation
-- 🟢 **High Quality (≥0.8)**: Well-written, coherent, informative text
-- 🟡 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
-- 🔴 **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
-## Model Information
-This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
-## API Usage
-You can also use this scorer programmatically:
-```python
-from datasets import load_dataset
-import requests
-# Load and score your dataset
-dataset = load_dataset("your-dataset")
-# ... scoring logic
-```
-## Limitations
-- Maximum 100,000 samples per run
-- Text-only datasets supported
-- English language optimized
-- First run downloads ~350MB model
-## Privacy & Security
-- Login required only for uploading to Hub
-- Datasets are processed locally in the Space
-- No data is stored permanently
-## Credits
 Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.

+---
+title: Dataset_Quality_Scorer
+emoji: 🎯
+colorFrom: purple
+colorTo: blue
+sdk: gradio
+sdk_version: 4.20.0
+app_file: app.py
+pinned: true
+license: apache-2.0
+models:
+  - openbmb/Ultra-FineWeb-classifier
+---
+Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
+## Features
+- 📊 **Fast Quality Scoring**: Process thousands of samples quickly using FastText
+- 🤗 **Hub Integration**: Direct search and load from Hugging Face datasets
+- 📈 **Visual Analytics**: Quality distribution plots and detailed statistics
+- ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
+- 🔒 **Private Repos**: Option to create private scored datasets
+- 📱 **Mobile Friendly**: Responsive design works on all devices
+## How It Works
+1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
+2. **Configure**: Choose split, text column, and sample size
+3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
+4. **Analyze**: View distribution plots and quality statistics
+5. **Share**: Upload scored dataset to your Hugging Face account
+## Quality Score Interpretation
+- 🟢 **High Quality (≥0.8)**: Well-written, coherent, informative text
+- 🟡 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
+- 🔴 **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
+## Model Information
+This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
+## API Usage
+You can also use this scorer programmatically:
+```python
+from datasets import load_dataset
+import requests
+# Load and score your dataset
+dataset = load_dataset("your-dataset")
+# ... scoring logic
+```
+## Limitations
+- Maximum 100,000 samples per run
+- Text-only datasets supported
+- English language optimized
+- First run downloads ~350MB model
+## Privacy & Security
+- Login required only for uploading to Hub
+- Datasets are processed locally in the Space
+- No data is stored permanently
+## Credits
 Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.