Spaces:
Running
Running
Update readme.md
Browse files
readme.md
CHANGED
@@ -1,74 +1,73 @@
|
|
1 |
-
---
|
2 |
-
title:
|
3 |
-
emoji: π―
|
4 |
-
colorFrom: purple
|
5 |
-
colorTo: blue
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 4.20.0
|
8 |
-
app_file: app.py
|
9 |
-
pinned: true
|
10 |
-
license: apache-2.0
|
11 |
-
models:
|
12 |
-
- openbmb/Ultra-FineWeb-classifier
|
13 |
-
---
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
-
|
22 |
-
-
|
23 |
-
-
|
24 |
-
-
|
25 |
-
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
-
|
39 |
-
-
|
40 |
-
|
41 |
-
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
46 |
-
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
61 |
-
-
|
62 |
-
-
|
63 |
-
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
-
|
69 |
-
-
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
|
74 |
Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.
|
|
|
1 |
+
---
|
2 |
+
title: Dataset_Quality_Scorer
|
3 |
+
emoji: π―
|
4 |
+
colorFrom: purple
|
5 |
+
colorTo: blue
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 4.20.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: true
|
10 |
+
license: apache-2.0
|
11 |
+
models:
|
12 |
+
- openbmb/Ultra-FineWeb-classifier
|
13 |
+
---
|
14 |
+
|
15 |
+
|
16 |
+
Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
|
17 |
+
|
18 |
+
## Features
|
19 |
+
|
20 |
+
- π **Fast Quality Scoring**: Process thousands of samples quickly using FastText
|
21 |
+
- π€ **Hub Integration**: Direct search and load from Hugging Face datasets
|
22 |
+
- π **Visual Analytics**: Quality distribution plots and detailed statistics
|
23 |
+
- βοΈ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
|
24 |
+
- π **Private Repos**: Option to create private scored datasets
|
25 |
+
- π± **Mobile Friendly**: Responsive design works on all devices
|
26 |
+
|
27 |
+
## How It Works
|
28 |
+
|
29 |
+
1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
|
30 |
+
2. **Configure**: Choose split, text column, and sample size
|
31 |
+
3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
|
32 |
+
4. **Analyze**: View distribution plots and quality statistics
|
33 |
+
5. **Share**: Upload scored dataset to your Hugging Face account
|
34 |
+
|
35 |
+
## Quality Score Interpretation
|
36 |
+
|
37 |
+
- π’ **High Quality (β₯0.8)**: Well-written, coherent, informative text
|
38 |
+
- π‘ **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
|
39 |
+
- π΄ **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
|
40 |
+
|
41 |
+
## Model Information
|
42 |
+
|
43 |
+
This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
|
44 |
+
|
45 |
+
## API Usage
|
46 |
+
|
47 |
+
You can also use this scorer programmatically:
|
48 |
+
|
49 |
+
```python
|
50 |
+
from datasets import load_dataset
|
51 |
+
import requests
|
52 |
+
|
53 |
+
# Load and score your dataset
|
54 |
+
dataset = load_dataset("your-dataset")
|
55 |
+
# ... scoring logic
|
56 |
+
```
|
57 |
+
|
58 |
+
## Limitations
|
59 |
+
|
60 |
+
- Maximum 100,000 samples per run
|
61 |
+
- Text-only datasets supported
|
62 |
+
- English language optimized
|
63 |
+
- First run downloads ~350MB model
|
64 |
+
|
65 |
+
## Privacy & Security
|
66 |
+
|
67 |
+
- Login required only for uploading to Hub
|
68 |
+
- Datasets are processed locally in the Space
|
69 |
+
- No data is stored permanently
|
70 |
+
|
71 |
+
## Credits
|
72 |
+
|
|
|
73 |
Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.
|