Add license to metadata and fix broken link in summary
Browse filesThis PR improves the model card by:
- Adding the `apache-2.0` license to the YAML metadata.
- Fixing a broken link in the "Model Summary" section, ensuring it correctly points to the paper.
README.md
CHANGED
|
@@ -1,8 +1,11 @@
|
|
| 1 |
---
|
| 2 |
-
pipeline_tag: text-classification
|
| 3 |
library_name: fasttext
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
|
|
|
|
|
|
| 6 |
<p align="center">
|
| 7 |
π <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a>    |    π¨ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a>    |    π€ <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a>    |    π¦ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
|
| 8 |
<br>
|
|
@@ -10,8 +13,7 @@ library_name: fasttext
|
|
| 10 |
|
| 11 |
|
| 12 |
## Model Summary
|
| 13 |
-
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper:
|
| 14 |
-
](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
|
| 15 |
The positive label name and negative label name are "__label__1" and "__label__0" respectively.
|
| 16 |
|
| 17 |
## How to use
|
|
@@ -47,7 +49,7 @@ dist_executor.run()
|
|
| 47 |
|
| 48 |
## Training
|
| 49 |
For more training details, you can refer to the paper and the training code is available on GitHub
|
| 50 |
-
[PreSelect](https://github.com/hkust-nlp/
|
| 51 |
|
| 52 |
## Citation
|
| 53 |
If you find this work helpful, please kindly cite as:
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: fasttext
|
| 3 |
+
pipeline_tag: text-classification
|
| 4 |
+
license: apache-2.0
|
| 5 |
---
|
| 6 |
|
| 7 |
+
# Predictive Data Selection: The Data That Predicts Is the Data That Teaches
|
| 8 |
+
|
| 9 |
<p align="center">
|
| 10 |
π <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a>    |    π¨ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a>    |    π€ <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a>    |    π¦ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
|
| 11 |
<br>
|
|
|
|
| 13 |
|
| 14 |
|
| 15 |
## Model Summary
|
| 16 |
+
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches](https://arxiv.org/abs/2503.00808). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
|
|
|
|
| 17 |
The positive label name and negative label name are "__label__1" and "__label__0" respectively.
|
| 18 |
|
| 19 |
## How to use
|
|
|
|
| 49 |
|
| 50 |
## Training
|
| 51 |
For more training details, you can refer to the paper and the training code is available on GitHub
|
| 52 |
+
[PreSelect](https://github.com/hkust-nlp/PreSelect).
|
| 53 |
|
| 54 |
## Citation
|
| 55 |
If you find this work helpful, please kindly cite as:
|