SLM pretrained from scratch
AI & ML interests
None defined yet.
a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets.
-
opencsg/Fineweb-Edu-Chinese-V2.1
Viewer • Updated • 958M • 22.2k • 40 -
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
Paper • 2501.08197 • Published • 8 -
opencsg/chinese-fineweb-edu-v2
Viewer • Updated • 188M • 1.34k • 64 -
opencsg/chinese-fineweb-edu
Viewer • Updated • 84.6M • 5.66k • 104
codeLlama finetune by OpenCSG
synthetic datasets
SLM pretrained from scratch
a suite of high-quality Chinese datasets, used for pretraining, fine-tuning or preference alignment. And the models trained on these datasets.
-
opencsg/Fineweb-Edu-Chinese-V2.1
Viewer • Updated • 958M • 22.2k • 40 -
OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
Paper • 2501.08197 • Published • 8 -
opencsg/chinese-fineweb-edu-v2
Viewer • Updated • 188M • 1.34k • 64 -
opencsg/chinese-fineweb-edu
Viewer • Updated • 84.6M • 5.66k • 104
codeLlama finetune by OpenCSG
starcoder finetuned by OpenCSG
synthetic datasets