deepseek-ai
/

DeepSeek-V3.1-Base

Text Generation

text-generation-inference

Model card Files Files and versions

msr2000 commited on 5 days ago

Commit

d3d4eaf

·

1 Parent(s): c230953

Update ue8m0 config & description

Files changed (2) hide show

README.md +8 -1
config.json +2 -1

README.md CHANGED Viewed

@@ -52,7 +52,9 @@ DeepSeek-V3.1 is a hybrid model that supports both thinking mode and non-thinkin
 - **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
-DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens. Additionally, DeepSeek-V3.1 is trained using the UE8M0 FP8 scale data format to ensure compatibility with microscaling data formats.
 ## Model Downloads
@@ -196,6 +198,11 @@ tokenizer.apply_chat_template(messages, tokenize=False, thinking=False, add_gene
 The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
 ## License
 This repository and the model weights are licensed under the [MIT License](LICENSE).

 - **Higher thinking efficiency**: DeepSeek-V3.1-Think achieves comparable answer quality to DeepSeek-R1-0528, while responding more quickly.
+DeepSeek-V3.1 is post-trained on the top of DeepSeek-V3.1-Base, which is built upon the original V3 base checkpoint through a two-phase long context extension approach, following the methodology outlined in the original DeepSeek-V3 report. We have expanded our dataset by collecting additional long documents and substantially extending both training phases. The 32K extension phase has been increased 10-fold to 630B tokens, while the 128K extension phase has been extended by 3.3x to 209B tokens.
+Additionally, DeepSeek-V3.1 is trained using the **UE8M0 FP8 scale data format on both model weights and activations** to ensure compatibility with microscaling data formats. Please refer to [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) for more details.
 ## Model Downloads
 The model structure of DeepSeek-V3.1 is the same as DeepSeek-V3. Please visit [DeepSeek-V3](https://github.com/deepseek-ai/DeepSeek-V3) repo for more information about running this model locally.
+**Usage Recommendations:**
+1. **The `mlp.gate.e_score_correction_bias `parameters should be loaded and computed in FP32 precision.**
+2. **Ensure that FP8 model weights and activations are formatted using the UE8M0 scale format.**
 ## License
 This repository and the model weights are licensed under the [MIT License](LICENSE).

config.json CHANGED Viewed

@@ -41,7 +41,8 @@
     "weight_block_size": [
       128,
       128
-    ]
   },
   "rms_norm_eps": 1e-06,
   "rope_scaling": {

     "weight_block_size": [
       128,
       128
+    ],
+    "scale_fmt": "ue8m0"
   },
   "rms_norm_eps": 1e-06,
   "rope_scaling": {