alexmarques commited on
Commit
c72805c
·
verified ·
1 Parent(s): ba7c2b7

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -0
README.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct/blob/main/LICENSE
4
+ language:
5
+ - en
6
+ pipeline_tag: text-generation
7
+ library_name: transformers
8
+ tags:
9
+ - code
10
+ - codeqwen
11
+ - chat
12
+ - qwen
13
+ - qwen-coder
14
+ - fp8
15
+ - llm-compressor
16
+ - compressed-tensors
17
+ - vllm
18
+ base_model:
19
+ - Qwen/Qwen2.5-Coder-14B-Instruct
20
+ ---
21
+ ## Model Overview
22
+ - **Model Architecture:** Qwen2ForCausalLM
23
+ - **Input:** Text
24
+ - **Output:** Text
25
+ - **Model Optimizations:**
26
+ - **Weight quantization:** FP8
27
+ - **Activation quantization:** FP8
28
+ - **Release Date:** 11/28/2024
29
+ - **Version:** 1.0
30
+ - **Model Developers:** Red Hat
31
+
32
+ Quantized version of [Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct).
33
+
34
+ ### Model Optimizations
35
+
36
+ This model was obtained by quantizing the weights and activations of [Qwen/Qwen2.5-Coder-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-14B-Instruct) to FP8 data type.
37
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
38
+ Only the weights and activations of the linear operators within transformers blocks are quantized.
39
+
40
+ ## Deployment
41
+
42
+ ### Use with vLLM
43
+
44
+ 1. Initialize vLLM server:
45
+ ```
46
+ vllm serve RedHatAI/Qwen2.5-Coder-14B-Instruct-FP8-dynamic
47
+ ```
48
+
49
+ 2. Send requests to the server:
50
+
51
+ ```python
52
+ from openai import OpenAI
53
+
54
+ # Modify OpenAI's API key and API base to use vLLM's API server.
55
+ openai_api_key = "EMPTY"
56
+ openai_api_base = "http://<your-server-host>:8000/v1"
57
+
58
+ client = OpenAI(
59
+ api_key=openai_api_key,
60
+ base_url=openai_api_base,
61
+ )
62
+
63
+ model = "RedHatAI/Qwen2.5-Coder-14B-Instruct-FP8-dynamic"
64
+
65
+ messages = [
66
+ [{"role": "user", "content": "Write a quick sort algorithm."}],
67
+ ]
68
+
69
+ outputs = client.chat.completions.create(
70
+ model=model,
71
+ messages=messages,
72
+ )
73
+
74
+ generated_text = outputs.choices[0].message.content
75
+ print(generated_text)
76
+ ```
77
+
78
+ ## Creation
79
+
80
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
81
+
82
+ <details>
83
+ <summary>Model Creation Code</summary>
84
+
85
+ ```python
86
+ from llmcompressor.modifiers.quantization import QuantizationModifier
87
+ from llmcompressor.transformers import oneshot
88
+ from transformers import AutoModelForCausalLM, AutoTokenizer
89
+
90
+ # Load model
91
+ model_stub = "Qwen/Qwen2.5-Coder-14B-Instruct"
92
+ model_name = model_stub.split("/")[-1]
93
+
94
+ model = AutoModelForCausalLM.from_pretrained(model_stub, dtype="auto")
95
+
96
+ tokenizer = AutoTokenizer.from_pretrained(model_stub)
97
+
98
+ # Configure the quantization algorithm and scheme
99
+ recipe = QuantizationModifier(
100
+ ignore=["lm_head"],
101
+ targets="Linear",
102
+ scheme="FP8_dynamic",
103
+ )
104
+
105
+ # Apply quantization
106
+ oneshot(
107
+ model=model,
108
+ recipe=recipe,
109
+ )
110
+
111
+ # Save to disk in compressed-tensors format
112
+ save_path = model_name + "-FP8-dynamic"
113
+ model.save_pretrained(save_path)
114
+ tokenizer.save_pretrained(save_path)
115
+ print(f"Model and tokenizer saved to: {save_path}")
116
+ ```
117
+ </details>
118
+