codelion commited on
Commit
ea63110
·
verified ·
1 Parent(s): 5045572

Fix dataset composition percentages and token counts

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -60,9 +60,9 @@ This model demonstrates the effectiveness of careful dataset composition for eff
60
 
61
  The model was trained on **1 billion tokens** with the following composition:
62
 
63
- - **40%** - FinePDFs (400M tokens): High-quality PDF content
64
  - **30%** - DCLM Baseline (300M tokens): Filtered web content
65
- - **30%** - FineWeb-Edu (300M tokens): Educational web content
66
 
67
  This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
68
 
 
60
 
61
  The model was trained on **1 billion tokens** with the following composition:
62
 
63
+ - **50%** - FinePDFs (500M tokens): High-quality PDF content
64
  - **30%** - DCLM Baseline (300M tokens): Filtered web content
65
+ - **20%** - FineWeb-Edu (200M tokens): Educational web content
66
 
67
  This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
68