Fix dataset composition percentages and token counts
Browse files
README.md
CHANGED
|
@@ -60,9 +60,9 @@ This model demonstrates the effectiveness of careful dataset composition for eff
|
|
| 60 |
|
| 61 |
The model was trained on **1 billion tokens** with the following composition:
|
| 62 |
|
| 63 |
-
- **
|
| 64 |
- **30%** - DCLM Baseline (300M tokens): Filtered web content
|
| 65 |
-
- **
|
| 66 |
|
| 67 |
This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
|
| 68 |
|
|
|
|
| 60 |
|
| 61 |
The model was trained on **1 billion tokens** with the following composition:
|
| 62 |
|
| 63 |
+
- **50%** - FinePDFs (500M tokens): High-quality PDF content
|
| 64 |
- **30%** - DCLM Baseline (300M tokens): Filtered web content
|
| 65 |
+
- **20%** - FineWeb-Edu (200M tokens): Educational web content
|
| 66 |
|
| 67 |
This 50-30-20 mixing ratio was identified through systematic experimentation as optimal for balanced performance across multiple domains.
|
| 68 |
|