dreamerdeo commited on
Commit
24ab78d
Β·
verified Β·
1 Parent(s): 7437b0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -12
README.md CHANGED
@@ -36,14 +36,13 @@ Read more details about Sailor2 at https://sailorllm.github.io/blog/sailor2.
36
  <b><font size="+1">πŸ“š Sailor2 Pre-training Dataset </font></b>
37
  </summary>
38
 
39
- - [Sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1)
40
- - [Sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2)
41
- - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl)
42
- - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet)
43
- - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl)
44
- - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text)
45
- - [sea-syntheitc](https://huggingface.co/datasets/sailor2/sea-syntheitc)
46
- - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality)
47
 
48
  </details>
49
 
@@ -54,11 +53,35 @@ Read more details about Sailor2 at https://sailorllm.github.io/blog/sailor2.
54
  <b><font size="+1">πŸ“‘ Sailor2 Post-training Dataset </font></b>
55
  </summary>
56
 
57
- - [sailor2-sft-stage1](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1)
58
- - [sailor2-sft-stage2](https://huggingface.co/datasets/sailor2/sailor2-sft-stage2)
59
- - [sea-ultrafeedback](https://huggingface.co/datasets/sailor2/sea-ultrafeedback)
60
- - [sea-wildbench](https://huggingface.co/datasets/sailor2/sea-wildbench)
 
 
 
 
 
 
 
 
61
 
 
 
62
  </details>
63
 
64
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  <b><font size="+1">πŸ“š Sailor2 Pre-training Dataset </font></b>
37
  </summary>
38
 
39
+ - [Sailor2-pretrain-data-stage1](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage1): 500B high quality data for model training
40
+ - [Sailor2-pretrain-data-stage2](https://huggingface.co/datasets/sailor2/sailor2-pretrain-data-stage2): 50B extra high quality data for model annealing
41
+ - [sea-commoncrawl](https://huggingface.co/datasets/sailor2/sea-commoncrawl): Cleaned and deduplicated commoncrawl
42
+ - [sea-internet](https://huggingface.co/datasets/sailor2/sea-internet): Cleaned multilingual data from Internet Archive
43
+ - [sea-pdf-text](https://huggingface.co/datasets/sailor2/sea-pdf-text): Cleaned pdf data
44
+ - [sea-syntheitc](https://huggingface.co/datasets/sailor2/sea-syntheitc): Translation dataset from Cosmopedia across multiple languages
45
+ - [sea-commoncrawl-high-quality](https://huggingface.co/datasets/sailor2/sea-commoncrawl-high-quality): extra cleaned and deduplicated commoncrawl
 
46
 
47
  </details>
48
 
 
53
  <b><font size="+1">πŸ“‘ Sailor2 Post-training Dataset </font></b>
54
  </summary>
55
 
56
+ - [sailor2-sft-stage1](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1): Medium-Quality Instruction tuning dataset, supports English, Chinese and 15 SEA languages.
57
+ - [sailor2-sft-stage2](https://huggingface.co/datasets/sailor2/sailor2-sft-stage2): High-Quality Instruction tuning dataset, supports English, Chinese and 15 SEA languages.
58
+ - [sea-ultrafeedback](https://huggingface.co/datasets/sailor2/sea-ultrafeedback): Preference optimization dataset, supports English, Chinese and 17 SEA languages.
59
+
60
+ </details>
61
+
62
+ ---
63
+
64
+ <details>
65
+ <summary>
66
+ <b><font size="+1">🧐 Sailor2 Evaluation Dataset </font></b>
67
+ </summary>
68
 
69
+ - [sea-wildbench](https://huggingface.co/datasets/sailor2/sea-wildbench): Chat model evaluation, supports 8 SEA languages.
70
+
71
  </details>
72
 
73
  ---
74
+
75
+ <details>
76
+ <summary>
77
+ <b><font size="+1">πŸ’» Sailor2 Codebase </font></b>
78
+ </summary>
79
+
80
+ - [SailCraft Code](https://github.com/sail-sg/sailcraft): Data cleaning
81
+ - [Regmix Code](https://github.com/sail-sg/regmix): Data mixture
82
+ - [SailCompass Code](https://huggingface.co/datasets/sailor2/sailor2-sft-stage1): Few-shot evaluation
83
+ - [Megatron Code](https://github.com/sail-sg/Megatron-Sailor2): Pretraining-training
84
+ - [OAT Code](https://github.com/sail-sg/oat): Post-training
85
+
86
+ </details>
87
+