C10X commited on
Commit
ec4ab54
Β·
verified Β·
1 Parent(s): c7ea9a1

Delete readme.md

Browse files
Files changed (1) hide show
  1. readme.md +0 -73
readme.md DELETED
@@ -1,73 +0,0 @@
1
- ---
2
- title: Dataset_Quality_Scorer
3
- emoji: 🎯
4
- colorFrom: purple
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 4.20.0
8
- app_file: app.py
9
- pinned: true
10
- license: apache-2.0
11
- models:
12
- - openbmb/Ultra-FineWeb-classifier
13
- ---
14
-
15
-
16
- Score your text datasets using the Ultra-FineWeb classifier for quality assessment.
17
-
18
- ## Features
19
-
20
- - πŸ“Š **Fast Quality Scoring**: Process thousands of samples quickly using FastText
21
- - πŸ€— **Hub Integration**: Direct search and load from Hugging Face datasets
22
- - πŸ“ˆ **Visual Analytics**: Quality distribution plots and detailed statistics
23
- - ☁️ **One-Click Upload**: Share your scored datasets on Hugging Face Hub
24
- - πŸ”’ **Private Repos**: Option to create private scored datasets
25
- - πŸ“± **Mobile Friendly**: Responsive design works on all devices
26
-
27
- ## How It Works
28
-
29
- 1. **Select Dataset**: Search and select any text dataset from Hugging Face Hub
30
- 2. **Configure**: Choose split, text column, and sample size
31
- 3. **Score**: The Ultra-FineWeb classifier scores each text (0-1 quality scale)
32
- 4. **Analyze**: View distribution plots and quality statistics
33
- 5. **Share**: Upload scored dataset to your Hugging Face account
34
-
35
- ## Quality Score Interpretation
36
-
37
- - 🟒 **High Quality (β‰₯0.8)**: Well-written, coherent, informative text
38
- - 🟑 **Medium Quality (0.5-0.8)**: Acceptable quality with some issues
39
- - πŸ”΄ **Low Quality (<0.5)**: Poor quality, may contain errors or low coherence
40
-
41
- ## Model Information
42
-
43
- This space uses the [Ultra-FineWeb classifier](https://huggingface.co/openbmb/Ultra-FineWeb-classifier), a FastText model trained to assess text quality based on the FineWeb dataset standards.
44
-
45
- ## API Usage
46
-
47
- You can also use this scorer programmatically:
48
-
49
- ```python
50
- from datasets import load_dataset
51
- import requests
52
-
53
- # Load and score your dataset
54
- dataset = load_dataset("your-dataset")
55
- # ... scoring logic
56
- ```
57
-
58
- ## Limitations
59
-
60
- - Maximum 100,000 samples per run
61
- - Text-only datasets supported
62
- - English language optimized
63
- - First run downloads ~350MB model
64
-
65
- ## Privacy & Security
66
-
67
- - Login required only for uploading to Hub
68
- - Datasets are processed locally in the Space
69
- - No data is stored permanently
70
-
71
- ## Credits
72
-
73
- Built by the C10X team using the Ultra-FineWeb classifier from OpenBMB.