C10X commited on
Commit
c7ea9a1
Β·
verified Β·
1 Parent(s): 400d8ff

Update readme.md

Browse files
Files changed (1) hide show
  1. readme.md +72 -73
readme.md CHANGED
@@ -1,74 +1,73 @@
1
- ---
2
- title: Dataset Quality Scorer
3
- emoji: 🎯
4
- colorFrom: purple
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 4.20.0
8
- app_file: app.py
9
- pinned: true
10
- license: apache-2.0
11
- models:
12
- - openbmb/Ultra-FineWeb-classifier
13
- ---
14
-
15
- # Dataset Quality Scorer 🎯
16
-
17
- Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
18
-
19
- ## Features
20
-
21
- - πŸ“Š **Fast Quality Scoring**: Process thousands of samples quickly using FastText
22
- - πŸ€— **Hub Integration**: Direct search and load from Hugging Face datasets
23
- - πŸ“ˆ **Visual Analytics**: Quality distribution plots and detailed statistics
24
- - ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
25
- - πŸ”’ **Private Repos**: Option to create private scored datasets
26
- - πŸ“± **Mobile Friendly**: Responsive design works on all devices
27
-
28
- ## How It Works
29
-
30
- 1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
31
- 2. **Configure**: Choose split, text column, and sample size
32
- 3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
33
- 4. **Analyze**: View distribution plots and quality statistics
34
- 5. **Share**: Upload scored dataset to your Hugging Face account
35
-
36
- ## Quality Score Interpretation
37
-
38
- - 🟒 **High Quality (β‰₯0.8)**: Well-written, coherent, informative text
39
- - 🟑 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
40
- - πŸ”΄ **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
41
-
42
- ## Model Information
43
-
44
- This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
45
-
46
- ## API Usage
47
-
48
- You can also use this scorer programmatically:
49
-
50
- ```python
51
- from datasets import load_dataset
52
- import requests
53
-
54
- # Load and score your dataset
55
- dataset = load_dataset("your-dataset")
56
- # ... scoring logic
57
- ```
58
-
59
- ## Limitations
60
-
61
- - Maximum 100,000 samples per run
62
- - Text-only datasets supported
63
- - English language optimized
64
- - First run downloads ~350MB model
65
-
66
- ## Privacy & Security
67
-
68
- - Login required only for uploading to Hub
69
- - Datasets are processed locally in the Space
70
- - No data is stored permanently
71
-
72
- ## Credits
73
-
74
  Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.
 
1
+ ---
2
+ title: Dataset_Quality_Scorer
3
+ emoji: 🎯
4
+ colorFrom: purple
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 4.20.0
8
+ app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ models:
12
+ - openbmb/Ultra-FineWeb-classifier
13
+ ---
14
+
15
+
16
+ Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
17
+
18
+ ## Features
19
+
20
+ - πŸ“Š **Fast Quality Scoring**: Process thousands of samples quickly using FastText
21
+ - πŸ€— **Hub Integration**: Direct search and load from Hugging Face datasets
22
+ - πŸ“ˆ **Visual Analytics**: Quality distribution plots and detailed statistics
23
+ - ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
24
+ - πŸ”’ **Private Repos**: Option to create private scored datasets
25
+ - πŸ“± **Mobile Friendly**: Responsive design works on all devices
26
+
27
+ ## How It Works
28
+
29
+ 1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
30
+ 2. **Configure**: Choose split, text column, and sample size
31
+ 3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
32
+ 4. **Analyze**: View distribution plots and quality statistics
33
+ 5. **Share**: Upload scored dataset to your Hugging Face account
34
+
35
+ ## Quality Score Interpretation
36
+
37
+ - 🟒 **High Quality (β‰₯0.8)**: Well-written, coherent, informative text
38
+ - 🟑 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
39
+ - πŸ”΄ **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
40
+
41
+ ## Model Information
42
+
43
+ This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
44
+
45
+ ## API Usage
46
+
47
+ You can also use this scorer programmatically:
48
+
49
+ ```python
50
+ from datasets import load_dataset
51
+ import requests
52
+
53
+ # Load and score your dataset
54
+ dataset = load_dataset("your-dataset")
55
+ # ... scoring logic
56
+ ```
57
+
58
+ ## Limitations
59
+
60
+ - Maximum 100,000 samples per run
61
+ - Text-only datasets supported
62
+ - English language optimized
63
+ - First run downloads ~350MB model
64
+
65
+ ## Privacy & Security
66
+
67
+ - Login required only for uploading to Hub
68
+ - Datasets are processed locally in the Space
69
+ - No data is stored permanently
70
+
71
+ ## Credits
72
+
 
73
  Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.