Spaces:

C10X
/

Dataset-Quality-Scorer

Running

App Files Files Community

C10X commited on 15 days ago

Commit

ec4ab54

verified ·

1 Parent(s): c7ea9a1

Delete readme.md

Browse files

Files changed (1) hide show

readme.md +0 -73

readme.md DELETED Viewed

@@ -1,73 +0,0 @@
----
-title: Dataset_Quality_Scorer
-emoji: 🎯
-colorFrom: purple
-colorTo: blue
-sdk: gradio
-sdk_version: 4.20.0
-app_file: app.py
-pinned: true
-license: apache-2.0
-models:
-  - openbmb/Ultra-FineWeb-classifier
----
-Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
-## Features
-- 📊 **Fast Quality Scoring**: Process thousands of samples quickly using FastText
-- 🤗 **Hub Integration**: Direct search and load from Hugging Face datasets
-- 📈 **Visual Analytics**: Quality distribution plots and detailed statistics
-- ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
-- 🔒 **Private Repos**: Option to create private scored datasets
-- 📱 **Mobile Friendly**: Responsive design works on all devices
-## How It Works
-1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
-2. **Configure**: Choose split, text column, and sample size
-3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
-4. **Analyze**: View distribution plots and quality statistics
-5. **Share**: Upload scored dataset to your Hugging Face account
-## Quality Score Interpretation
-- 🟢 **High Quality (≥0.8)**: Well-written, coherent, informative text
-- 🟡 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
-- 🔴 **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
-## Model Information
-This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
-## API Usage
-You can also use this scorer programmatically:
-```python
-from datasets import load_dataset
-import requests
-# Load and score your dataset
-dataset = load_dataset("your-dataset")
-# ... scoring logic
-```
-## Limitations
-- Maximum 100,000 samples per run
-- Text-only datasets supported
-- English language optimized
-- First run downloads ~350MB model
-## Privacy & Security
-- Login required only for uploading to Hub
-- Datasets are processed locally in the Space
-- No data is stored permanently
-## Credits
-Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.