Update ue8m0 config & description
Browse files- README.md +8 -1
- config.json +2 -1
README.md
CHANGED
@@ -52,7 +52,9 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
|
|
52 |
|
53 |
- **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
|
54 |
|
55 |
-
DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens.
|
|
|
|
|
56 |
|
57 |
## Model Downloads
|
58 |
|
@@ -196,6 +198,11 @@ tokenizer.apply_chat_template(messages, tokenize=False, thinking=False, add_gene
|
|
196 |
|
197 |
The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
|
198 |
|
|
|
|
|
|
|
|
|
|
|
199 |
## License
|
200 |
|
201 |
This repository and the model weights are licensed under the [MIT License](LICENSE).
|
|
|
52 |
|
53 |
- **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
|
54 |
|
55 |
+
DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens.
|
56 |
+
|
57 |
+
Additionally, DeepSeek-V3.1 is trained using the **UE8M0 FP8 scale data format on both model weights and activations** to ensure compatibility with microscaling data formats. Please refer to [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) for more details.
|
58 |
|
59 |
## Model Downloads
|
60 |
|
|
|
198 |
|
199 |
The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
|
200 |
|
201 |
+
**Usage Recommendations:**
|
202 |
+
|
203 |
+
1. **The `mlp.gate.e_score_correction_bias `parameters should be loaded and computed in FP32 precision.**
|
204 |
+
2. **Ensure that FP8 model weights and activations are formatted using the UE8M0 scale format.**
|
205 |
+
|
206 |
## License
|
207 |
|
208 |
This repository and the model weights are licensed under the [MIT License](LICENSE).
|
config.json
CHANGED
@@ -41,7 +41,8 @@
|
|
41 |
"weight_block_size": [
|
42 |
128,
|
43 |
128
|
44 |
-
]
|
|
|
45 |
},
|
46 |
"rms_norm_eps": 1e-06,
|
47 |
"rope_scaling": {
|
|
|
41 |
"weight_block_size": [
|
42 |
128,
|
43 |
128
|
44 |
+
],
|
45 |
+
"scale_fmt": "ue8m0"
|
46 |
},
|
47 |
"rms_norm_eps": 1e-06,
|
48 |
"rope_scaling": {
|