msr2000 commited on
Commit
d3d4eaf
·
1 Parent(s): c230953

Update ue8m0 config & description

Browse files
Files changed (2) hide show
  1. README.md +8 -1
  2. config.json +2 -1
README.md CHANGED
@@ -52,7 +52,9 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
52
 
53
  - **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
54
 
55
- DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.
 
 
56
 
57
  ## Model Downloads
58
 
@@ -196,6 +198,11 @@ tokenizer.apply_chat_template(messages, tokenize=False, thinking=False, add_gene
196
 
197
  The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
198
 
 
 
 
 
 
199
  ## License
200
 
201
  This repository and the model weights are licensed under the [MIT License](LICENSE).
 
52
 
53
  - **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
54
 
55
+ DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens.
56
+
57
+ Additionally, DeepSeek-V3.1 is trained using the **UE8M0 FP8 scale data format on both model weights and activations** to ensure compatibility with microscaling data formats. Please refer to [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) for more details.
58
 
59
  ## Model Downloads
60
 
 
198
 
199
  The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
200
 
201
+ **Usage Recommendations:**
202
+
203
+ 1. **The `mlp.gate.e_score_correction_bias `parameters should be loaded and computed in FP32 precision.**
204
+ 2. **Ensure that FP8 model weights and activations are formatted using the UE8M0 scale format.**
205
+
206
  ## License
207
 
208
  This repository and the model weights are licensed under the [MIT License](LICENSE).
config.json CHANGED
@@ -41,7 +41,8 @@
41
  "weight_block_size": [
42
  128,
43
  128
44
- ]
 
45
  },
46
  "rms_norm_eps": 1e-06,
47
  "rope_scaling": {
 
41
  "weight_block_size": [
42
  128,
43
  128
44
+ ],
45
+ "scale_fmt": "ue8m0"
46
  },
47
  "rms_norm_eps": 1e-06,
48
  "rope_scaling": {