File size: 6,743 Bytes
cf26e9a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
license: apache-2.0
tags:
- qwen
- qwen2
- fp8
- quantization
- llm-compressor
- vllm
- code-generation
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-Coder-32B-Instruct
---
# Qwen2.5-Coder-32B-Instruct-FP8-dynamic
This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).
## Model Description
Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.
## Quantization with llm-compressor
The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme.
No calibration dataset was required for this quantization scheme.
The following script was used for conversion:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import os
# --- 1. Set the new Model ID ---
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"
# --- 2. Load model and tokenizer using Auto classes ---
print(f"Loading model: {MODEL_ID}...")
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True,
)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
# --- 3. The quantization recipe remains the same ---
print("Configuring FP8 quantization recipe...")
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
# Apply quantization. This step can take some time.
print("Applying one-shot quantization...")
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
print("Quantization complete.")
# --- 4. Confirm generation with the Qwen chat template ---
print("\n========== SAMPLE GENERATION ==============")
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
messages = [
{"role": "system", "content": "You are a helpful assistant specialized in writing code."},
{"role": "user", "content": prompt}
]
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
output_ids = model.generate(
**model_inputs,
max_new_tokens=256,
)
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(f"Generated Response:\n{response}")
print("==========================================")
# --- 5. Save the quantized model and the tokenizer correctly ---
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
print(f"\nSaving quantized model to {SAVE_DIR}...")
model.save_pretrained(SAVE_DIR)
print(f"Saving tokenizer to {SAVE_DIR}...")
tokenizer.save_pretrained(SAVE_DIR)
print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
```
## Inference Example
This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).
### Using `transformers` (for functional checking, not FP8 optimized)
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"
# For Qwen models, it is recommended to use trust_remote_code=True
model = AutoModelForCausalLM.from_pretrained(
MODEL_REPO_ID,
device_map="auto",
torch_dtype="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
MODEL_REPO_ID,
trust_remote_code=True
)
prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
messages = [
{"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
{"role": "user", "content": prompt}
]
# Apply the chat template to format the prompt correctly
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
# Tokenize the input and move to the device
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)
# Generate output
output_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.6,
top_p=0.9
)
# Decode only the newly generated tokens
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print("--- Prompt ---")
print(prompt)
print("\n--- Qwen Response ---")
print(response)
```
### Using vLLM (for optimized FP8 inference)
This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
Prerequisites:
- A recent version of vLLM that supports compressed-tensors.
- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
- Docker and NVIDIA Container Toolkit installed.
Running with Docker (Recommended):
The following command starts a vLLM OpenAI-compatible server with this quantized model:
```bash
# 1. Set your Hugging Face Token (optional, but recommended)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with a recent official build.
sudo docker run --gpus all \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-p 8000:8000 \
-e HF_TOKEN="$HF_TOKEN" \
vllm/vllm-openai:latest \
--model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
--tokenizer-mode auto \
--load-format auto \
--trust-remote-code \
--max-model-len 4096 # Optional: Adjust based on your VRAM
```
Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.
## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct |