manaestras commited on
Commit
cb0c252
·
verified ·
1 Parent(s): ff254ae

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +22 -20
README.md CHANGED
@@ -1,3 +1,8 @@
 
 
 
 
 
1
 
2
 
3
 
@@ -9,7 +14,7 @@
9
 
10
  <p align="center">
11
  🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>HuggingFace</b></a>&nbsp;|&nbsp;
12
- 🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct"><b>ModelScope</b></a>&nbsp;|&nbsp;
13
  🪡&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
14
  </p>
15
 
@@ -20,15 +25,13 @@
20
  </p>
21
 
22
  <p align="center">
23
- <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B"><b>GITHUB</b></a> |
24
- <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-7B"><b>cnb.cool</b></a> |
25
- <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE"><b>LICENSE</b></a> |
26
- <a href="https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg"><b>WeChat</b></a> |
27
  <a href="https://discord.gg/bsPcMEtV7v"><b>Discord</b></a>
28
  </p>
29
 
30
-
31
-
32
  ## Model Introduction
33
 
34
  Hunyuan is Tencent's open-source efficient large language model series, designed for versatile deployment across diverse computational environments. From edge devices to high-concurrency production systems, these models deliver optimal performance with advanced quantization support and ultra-long context capabilities.
@@ -47,10 +50,9 @@ We have released a series of Hunyuan dense models, comprising both pre-trained a
47
  <br>
48
 
49
 
50
-
51
  ## Benchmark
52
 
53
- Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
54
 
55
  | Model | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
56
  |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
@@ -88,7 +90,7 @@ First, please install transformers. We will merge it into the main branch later.
88
  ```SHELL
89
  pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
90
  ```
91
- Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
92
  1. Pass **"enable_thinking=False"** when calling apply_chat_template.
93
  2. Adding **"/no_think"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **"/think"** before the prompt will force the model to perform CoT reasoning.
94
 
@@ -111,7 +113,7 @@ messages = [
111
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
112
  enable_thinking=True # Toggle thinking mode (default: True)
113
  )
114
-
115
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
116
 
117
  output_text = tokenizer.decode(outputs[0])
@@ -272,7 +274,7 @@ We use FP8-static quantization, FP8 quantization adopts 8-bit floating point for
272
  ### Int4 Quantization
273
  We use the GPTQ and AWQ algorithm to achieve W4A16 quantization.
274
 
275
- GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold.
276
  AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.
277
 
278
  You can use [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
@@ -294,19 +296,19 @@ This subsection describes the Benchmark metrics for the Hunyuan quantitative mod
294
 
295
  For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
296
 
297
- image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags
298
 
299
 
300
  ### TensorRT-LLM
301
 
302
- #### Docker Image
303
 
304
  We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
305
 
306
  We use tencent/Hunyuan-7B-Instruct for example
307
  - To get started:
308
 
309
- https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
310
 
311
  ```
312
  docker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
@@ -357,14 +359,14 @@ trtllm-serve \
357
  Please use vLLM version v0.10.0 or higher for inference.
358
 
359
  We use tencent/Hunyuan-7B-Instruct for example
360
- - Download Model file:
361
  - Huggingface: will download automicly by vllm.
362
  - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`
363
-
364
  - model download by huggingface:
365
  ```shell
366
  export MODEL_PATH=tencent/Hunyuan-7B-Instruct
367
- ```
368
 
369
  - model downloaded by modelscope:
370
  ```shell
@@ -384,7 +386,7 @@ python3 -m vllm.entrypoints.openai.api_server \
384
  --quantization experts_int8 \
385
  --served-model-name hunyuan \
386
  2>&1 | tee log_server.txt
387
- ```
388
  - After running service script successfully, run the request script
389
  ```shell
390
  curl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
@@ -472,7 +474,7 @@ python3 -m vllm.entrypoints.openai.api_server \
472
 
473
  ### SGLang
474
 
475
- #### Docker Image
476
 
477
  We also provide a pre-built Docker image based on the latest version of SGLang.
478
 
 
1
+ ---
2
+ base_model:
3
+ - tencent/Hunyuan-4B-Pretrain
4
+ library_name: transformers
5
+ ---
6
 
7
 
8
 
 
14
 
15
  <p align="center">
16
  🤗&nbsp;<a href="https://huggingface.co/tencent/"><b>HuggingFace</b></a>&nbsp;|&nbsp;
17
+ 🤖&nbsp;<a href="https://modelscope.cn/models/Tencent-Hunyuan/"><b>ModelScope</b></a>&nbsp;|&nbsp;
18
  🪡&nbsp;<a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
19
  </p>
20
 
 
25
  </p>
26
 
27
  <p align="center">
28
+ <a href="https://github.com/Tencent-Hunyuan/"><b>GITHUB</b></a> |
29
+ <a href="https://cnb.cool/tencent/hunyuan/"><b>cnb.cool</b></a> |
30
+ <a href="https://github.com/Tencent-Hunyuan/Hunyuan-0.5B/blob/main/LICENSE"><b>LICENSE</b></a> |
31
+ <a href="https://raw.githubusercontent.com/Tencent-Hunyuan/Hunyuan-A13B/main/assets/1751881231452.jpg"><b>WeChat</b></a> |
32
  <a href="https://discord.gg/bsPcMEtV7v"><b>Discord</b></a>
33
  </p>
34
 
 
 
35
  ## Model Introduction
36
 
37
  Hunyuan is Tencent's open-source efficient large language model series, designed for versatile deployment across diverse computational environments. From edge devices to high-concurrency production systems, these models deliver optimal performance with advanced quantization support and ultra-long context capabilities.
 
50
  <br>
51
 
52
 
 
53
  ## Benchmark
54
 
55
+ Note: The following benchmarks are evaluated by TRT-LLM-backend on several **base models**.
56
 
57
  | Model | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
58
  |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
 
90
  ```SHELL
91
  pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
92
  ```
93
+ Our model defaults to using slow-thinking reasoning, and there are two ways to disable CoT reasoning.
94
  1. Pass **"enable_thinking=False"** when calling apply_chat_template.
95
  2. Adding **"/no_think"** before the prompt will force the model not to use perform CoT reasoning. Similarly, adding **"/think"** before the prompt will force the model to perform CoT reasoning.
96
 
 
113
  tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
114
  enable_thinking=True # Toggle thinking mode (default: True)
115
  )
116
+
117
  outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=2048)
118
 
119
  output_text = tokenizer.decode(outputs[0])
 
274
  ### Int4 Quantization
275
  We use the GPTQ and AWQ algorithm to achieve W4A16 quantization.
276
 
277
+ GPTQ processes the model weights layer by layer, uses a small amount of calibration data to minimize the reconfiguration error of the quantized weights, and adjusts the weights layer by layer by the optimization process of approximating the Hessian inverse matrix. The process eliminates the need to retrain the model and requires only a small amount of calibration data to quantize the weights, improving inference efficiency and lowering the deployment threshold.
278
  AWQ using a small amount of calibration data (without the need for training), the amplitude of the activation values is statistically calculated. For each weight channel, a scaling coefficient s is computed to expand the numerical range of important weights, allowing more information to be retained during quantization.
279
 
280
  You can use [AngleSlim](https://github.com/tencent/AngelSlim) quantization, you can also directly download our quantization completed open source model to use [LINK](https://huggingface.co/).
 
296
 
297
  For deployment, you can use frameworks such as **TensorRT-LLM**, **vLLM**, or **SGLang** to serve the model and create an OpenAI-compatible API endpoint.
298
 
299
+ image: https://hub.docker.com/r/hunyuaninfer/hunyuan-7B/tags
300
 
301
 
302
  ### TensorRT-LLM
303
 
304
+ #### Docker Image
305
 
306
  We provide a pre-built Docker image based on the latest version of TensorRT-LLM.
307
 
308
  We use tencent/Hunyuan-7B-Instruct for example
309
  - To get started:
310
 
311
+ https://hub.docker.com/r/hunyuaninfer/hunyuan-large/tags
312
 
313
  ```
314
  docker pull hunyuaninfer/hunyuan-7B:hunyuan-moe-7B-trtllm
 
359
  Please use vLLM version v0.10.0 or higher for inference.
360
 
361
  We use tencent/Hunyuan-7B-Instruct for example
362
+ - Download Model file:
363
  - Huggingface: will download automicly by vllm.
364
  - ModelScope: `modelscope download --model Tencent-Hunyuan/Hunyuan-7B-Instruct`
365
+
366
  - model download by huggingface:
367
  ```shell
368
  export MODEL_PATH=tencent/Hunyuan-7B-Instruct
369
+ ```
370
 
371
  - model downloaded by modelscope:
372
  ```shell
 
386
  --quantization experts_int8 \
387
  --served-model-name hunyuan \
388
  2>&1 | tee log_server.txt
389
+ ```
390
  - After running service script successfully, run the request script
391
  ```shell
392
  curl http://0.0.0.0:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
 
474
 
475
  ### SGLang
476
 
477
+ #### Docker Image
478
 
479
  We also provide a pre-built Docker image based on the latest version of SGLang.
480