File size: 6,743 Bytes
cf26e9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
license: apache-2.0
tags:
- qwen
- qwen2
- fp8
- quantization
- llm-compressor
- vllm
- code-generation
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-Coder-32B-Instruct
---

# Qwen2.5-Coder-32B-Instruct-FP8-dynamic

This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).

## Model Description

Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.

## Quantization with llm-compressor

The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme. 
No calibration dataset was required for this quantization scheme.

The following script was used for conversion:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import os

# --- 1. Set the new Model ID ---
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"

# --- 2. Load model and tokenizer using Auto classes ---
print(f"Loading model: {MODEL_ID}...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

# --- 3. The quantization recipe remains the same ---
print("Configuring FP8 quantization recipe...")
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Apply quantization. This step can take some time.
print("Applying one-shot quantization...")
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
print("Quantization complete.")

# --- 4. Confirm generation with the Qwen chat template ---
print("\n========== SAMPLE GENERATION ==============")
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing code."},
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

output_ids = model.generate(
    **model_inputs,
    max_new_tokens=256,
)

input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(f"Generated Response:\n{response}")
print("==========================================")


# --- 5. Save the quantized model and the tokenizer correctly ---
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
print(f"\nSaving quantized model to {SAVE_DIR}...")
model.save_pretrained(SAVE_DIR)

print(f"Saving tokenizer to {SAVE_DIR}...")
tokenizer.save_pretrained(SAVE_DIR)

print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
```


## Inference Example
This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).

### Using `transformers` (for functional checking, not FP8 optimized)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"

# For Qwen models, it is recommended to use trust_remote_code=True
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO_ID, 
    device_map="auto", 
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_REPO_ID,
    trust_remote_code=True
)

prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to format the prompt correctly
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to the device
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

# Generate output
output_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024, 
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

# Decode only the newly generated tokens
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("--- Prompt ---")
print(prompt)
print("\n--- Qwen Response ---")
print(response)
```

### Using vLLM (for optimized FP8 inference)
This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
Prerequisites:
- A recent version of vLLM that supports compressed-tensors.
- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
- Docker and NVIDIA Container Toolkit installed.

Running with Docker (Recommended):
The following command starts a vLLM OpenAI-compatible server with this quantized model:
```bash
# 1. Set your Hugging Face Token (optional, but recommended)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with a recent official build.
sudo docker run --gpus all \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 \
    -e HF_TOKEN="$HF_TOKEN" \
    vllm/vllm-openai:latest \
    --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
    --tokenizer-mode auto \
    --load-format auto \
    --trust-remote-code \
    --max-model-len 4096 # Optional: Adjust based on your VRAM
```

Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.

## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct