SivilTaram commited on
Commit
065c030
·
verified ·
1 Parent(s): 6e415fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -7
README.md CHANGED
@@ -36,13 +36,14 @@ Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.
36
  <b><font size="+1">📚 Sailor2 Pre-training Dataset </font></b>
37
  </summary>
38
 
39
- - [Sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): 450B high quality data for model training
40
- - [Sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): 60B extra high quality data for model annealing
41
- - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Cleaned and deduplicated commoncrawl
42
- - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Cleaned multilingual data from Internet Archive
43
- - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Cleaned pdf data
44
- - [sea-synthetic](https://huggingface.co/datasets/sailor2/sea-synthetic): Translation dataset from Cosmopedia across multiple languages
45
- - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): extra cleaned and deduplicated commoncrawl
 
46
 
47
  </details>
48
 
 
36
  <b><font size="+1">📚 Sailor2 Pre-training Dataset </font></b>
37
  </summary>
38
 
39
+ - [Sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): A comprehensive dataset comprising 450B tokens of high-quality data for continual pre-training, including English (from [ProX](https://huggingface.co/datasets/gair-prox/FineWeb-pro)), Chinese (from [Chinese-Fineweb-Edu](https://huggingface.co/datasets/opencsg/chinese-fineweb-edu)), Vietnamese, Indonesian, Thai, Malay, Burmese, Tagalog, and Khmer, organized by **chunks**
40
+ - [Sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): An additional 60B tokens of exceptionally high-quality data for model annealing, including the above languages and Cebuano, Lao, Javanese, Waray, Sundanese, and Ilocano, organized by **chunks**
41
+ - [community-dataset](https://huggingface.co/datasets/sailor2/community-dataset): Clean South-East Asia datasets sourced from the whole community members, including Indonesian, Thai, and Vietnamese content in fields like news, finance, law, books, poetry, social media, and TED Talks, organized by **source**
42
+ - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Clean South-East Asia related web corpora from 89 CommonCrawl snapshots, organized by **languages**
43
+ - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Clean multilingual data from Internet Archive, organized by languages, cleaned and deduplicated from the awesome dataset provided by [A New Massive Multilingual Dataset for High-Performance Language Technologies](https://arxiv.org/abs/2403.14009), organized by **languages**
44
+ - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Clean pdf data, the PDF links are sourced from partner information, organized by **languages**
45
+ - [sea-synthetic](https://huggingface.co/datasets/sailor2/sea-synthetic): Translation dataset from Cosmopedia across multiple languages, which is used to retreive the high-quality tokens for stage 2, organized by **languages**
46
+ - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): the high-quality CommonCrawl subset, which is used in stage 2 of Sailor2 pre-training, organized by **languages**
47
 
48
  </details>
49