tencent
/

Hunyuan-4B-Instruct-FP8

@@ -1,6 +1,6 @@
 ---
 base_model:
-- tencent/Hunyuan-4B-Instruct
 library_name: transformers
 ---
@@ -14,7 +14,7 @@ library_name: transformers
 <p align="center">
     🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>HuggingFace</b></a>&nbsp;|&nbsp;
-    🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>&nbsp;|&nbsp;
     🪡&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
 </p>
@@ -25,10 +25,10 @@ library_name: transformers
 </p>
 <p align="center">
-    <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B"><b>GITHUB</b></a> |
-    <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-7B"><b>cnb.cool</b></a> |
-    <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE"><b>LICENSE</b></a> |
-    <a href="https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg"><b>WeChat</b></a> |
     <a href="https://discord.gg/bsPcMEtV7v"><b>Discord</b></a>
 </p>
@@ -52,7 +52,7 @@ We have released a series of Hunyuan dense models, comprising both pre-trained a
 ## Benchmark
-Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
 | Model            | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
 |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
@@ -90,7 +90,7 @@ First, please install transformers. We will merge it into the main branch later.
 ```SHELL
 pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
 ```
-Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
 1. Pass **"enable_thinking=False"** when calling apply_chat_template.
 2. Adding **"/no_think"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **"/think"** before the prompt will force the model to perform CoT reasoning.
@@ -113,7 +113,7 @@ messages = [
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
                                                 enable_thinking=True # Toggle thinking mode (default: True)
                                                 )
 outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
 output_text = tokenizer.decode(outputs[0])
@@ -274,7 +274,7 @@ We use FP8-static quantization, FP8 quantization adopts 8-bit floating point for
 ### Int4 Quantization
 We use the GPTQ and AWQ algorithm to achieve W4A16 quantization.
-GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold.
 AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.
 You can use  [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
@@ -296,19 +296,19 @@ This subsection describes the Benchmark metrics for the Hunyuan quantitative mod
 For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
-image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags
 ### TensorRT-LLM
-#### Docker Image
 We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
 We use tencent/Hunyuan-7B-Instruct for example
 - To get started:
-https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
 ```
 docker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
@@ -359,14 +359,14 @@ trtllm-serve \
 Please use vLLM version v0.10.0 or higher for inference.
 We use tencent/Hunyuan-7B-Instruct for example
-- Download Model file:
   - Huggingface:  will download automicly by vllm.
   - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`
 - model download by huggingface:
 ```shell
 export MODEL_PATH=tencent/Hunyuan-7B-Instruct
-```
 - model downloaded by modelscope:
 ```shell
@@ -386,7 +386,7 @@ python3 -m vllm.entrypoints.openai.api_server \
     --quantization experts_int8 \
     --served-model-name hunyuan \
     2>&1 | tee log_server.txt
-```
 - After running service script successfully, run the request script
 ```shell
 curl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
@@ -474,7 +474,7 @@ python3 -m vllm.entrypoints.openai.api_server \
 ### SGLang
-#### Docker Image
 We also provide a pre-built Docker image based on the latest version of SGLang.
@@ -504,4 +504,4 @@ docker run --entrypoint="python3" --gpus all \
 ## Contact Us
-If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).

 ---
 base_model:
+- tencent/Hunyuan-4B-Pretrain
 library_name: transformers
 ---
 <p align="center">
     🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>HuggingFace</b></a>&nbsp;|&nbsp;
+    🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/"><b>ModelScope</b></a>&nbsp;|&nbsp;
     🪡&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
 </p>
 </p>
 <p align="center">
+    <a href="https://github.com/Tencent-Hunyuan/"><b>GITHUB</b></a> |
+    <a href="https://cnb.cool/tencent/hunyuan/"><b>cnb.cool</b></a> |
+    <a href="https://github.com/Tencent-Hunyuan/Hunyuan-4B/blob/main/LICENSE"><b>LICENSE</b></a> |
+    <a href="https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg"><b>WeChat</b></a> |
     <a href="https://discord.gg/bsPcMEtV7v"><b>Discord</b></a>
 </p>
 ## Benchmark
+Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
 | Model            | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
 |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
 ```SHELL
 pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
 ```
+Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
 1. Pass **"enable_thinking=False"** when calling apply_chat_template.
 2. Adding **"/no_think"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **"/think"** before the prompt will force the model to perform CoT reasoning.
 tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
                                                 enable_thinking=True # Toggle thinking mode (default: True)
                                                 )
 outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
 output_text = tokenizer.decode(outputs[0])
 ### Int4 Quantization
 We use the GPTQ and AWQ algorithm to achieve W4A16 quantization.
+GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold.
 AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.
 You can use  [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
 For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
+image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags
 ### TensorRT-LLM
+#### Docker Image
 We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
 We use tencent/Hunyuan-7B-Instruct for example
 - To get started:
+https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
 ```
 docker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
 Please use vLLM version v0.10.0 or higher for inference.
 We use tencent/Hunyuan-7B-Instruct for example
+- Download Model file:
   - Huggingface:  will download automicly by vllm.
   - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`
 - model download by huggingface:
 ```shell
 export MODEL_PATH=tencent/Hunyuan-7B-Instruct
+```
 - model downloaded by modelscope:
 ```shell
     --quantization experts_int8 \
     --served-model-name hunyuan \
     2>&1 | tee log_server.txt
+```
 - After running service script successfully, run the request script
 ```shell
 curl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
 ### SGLang
+#### Docker Image
 We also provide a pre-built Docker image based on the latest version of SGLang.
 ## Contact Us
+If you would like to leave a message for our R&D and product teams, Welcome to contact our open-source team . You can also contact us via email ([email protected]).