File size: 6,743 Bytes

cf26e9a

---
license: apache-2.0
tags:
- qwen
- qwen2
- fp8
- quantization
- llm-compressor
- vllm
- code-generation
pipeline_tag: text-generation
base_model:
- Qwen/Qwen2.5-Coder-32B-Instruct
---

# Qwen2.5-Coder-32B-Instruct-FP8-dynamic

This is a version of [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).

This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability >= 8.9 (Ada Lovelace, Hopper, Blackwell or newer).

## Model Description

Qwen2.5-Coder-32B-Instruct is a state-of-the-art, large language model from Alibaba Cloud, specialized for coding tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision to maintain output quality.

## Quantization with llm-compressor

The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme. 
No calibration dataset was required for this quantization scheme.

The following script was used for conversion:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
import os

# --- 1. Set the new Model ID ---
MODEL_ID = "Qwen/Qwen2.5-Coder-32B-Instruct"

# --- 2. Load model and tokenizer using Auto classes ---
print(f"Loading model: {MODEL_ID}...")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype="auto",
    trust_remote_code=True,
)
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

# --- 3. The quantization recipe remains the same ---
print("Configuring FP8 quantization recipe...")
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Apply quantization. This step can take some time.
print("Applying one-shot quantization...")
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
print("Quantization complete.")

# --- 4. Confirm generation with the Qwen chat template ---
print("\n========== SAMPLE GENERATION ==============")
prompt = "Write a Python function for a quicksort algorithm. Include comments to explain the logic."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing code."},
    {"role": "user", "content": prompt}
]

input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

output_ids = model.generate(
    **model_inputs,
    max_new_tokens=256,
)

input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print(f"Generated Response:\n{response}")
print("==========================================")


# --- 5. Save the quantized model and the tokenizer correctly ---
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
print(f"\nSaving quantized model to {SAVE_DIR}...")
model.save_pretrained(SAVE_DIR)

print(f"Saving tokenizer to {SAVE_DIR}...")
tokenizer.save_pretrained(SAVE_DIR)

print(f"\nModel and tokenizer saved successfully to '{SAVE_DIR}'")
```


## Inference Example
This model can be loaded and run with `transformers`, or for optimized FP8 inference, with [vLLM](https://github.com/vllm-project/vllm/).

### Using `transformers` (for functional checking, not FP8 optimized)

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_REPO_ID = "textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic"

# For Qwen models, it is recommended to use trust_remote_code=True
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO_ID, 
    device_map="auto", 
    torch_dtype="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_REPO_ID,
    trust_remote_code=True
)

prompt = "Write a complete and efficient implementation of the merge sort algorithm in Rust."
messages = [
    {"role": "system", "content": "You are a helpful assistant specialized in writing high-quality Rust code."},
    {"role": "user", "content": prompt}
]

# Apply the chat template to format the prompt correctly
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize the input and move to the device
model_inputs = tokenizer([input_text], return_tensors="pt").to(model.device)

# Generate output
output_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024, 
    do_sample=True,
    temperature=0.6,
    top_p=0.9
)

# Decode only the newly generated tokens
input_token_len = model_inputs.input_ids.shape[1]
generated_tokens = output_ids[0, input_token_len:]
response = tokenizer.decode(generated_tokens, skip_special_tokens=True)

print("--- Prompt ---")
print(prompt)
print("\n--- Qwen Response ---")
print(response)
```

### Using vLLM (for optimized FP8 inference)
This model, quantized to FP8 with llm-compressor, is designed for efficient inference with vLLM on newer NVIDIA GPUs.
Prerequisites:
- A recent version of vLLM that supports compressed-tensors.
- A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer).
- Docker and NVIDIA Container Toolkit installed.

Running with Docker (Recommended):
The following command starts a vLLM OpenAI-compatible server with this quantized model:
```bash
# 1. Set your Hugging Face Token (optional, but recommended)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"

# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with a recent official build.
sudo docker run --gpus all \
    -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
    -p 8000:8000 \
    -e HF_TOKEN="$HF_TOKEN" \
    vllm/vllm-openai:latest \
    --model textgeflecht/Qwen2.5-Coder-32B-Instruct-FP8-dynamic \
    --tokenizer-mode auto \
    --load-format auto \
    --trust-remote-code \
    --max-model-len 4096 # Optional: Adjust based on your VRAM
```

Once running, the server exposes an OpenAI-compatible API at http://localhost:8000/v1/. You can use any OpenAI client library (e.g., openai for Python) or curl to send requests.

## Original Model Card (Qwen/Qwen2.5-Coder-32B-Instruct)
For more details on the base model, its capabilities, and licensing, please refer to the original model card: https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct