nm-research commited on
Commit
dbf5e15
·
verified ·
1 Parent(s): 9e294a1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +293 -0
README.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - w4a16
4
+ - int4
5
+ - vllm
6
+ - audio
7
+ license: apache-2.0
8
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
9
+ language:
10
+ - en
11
+ base_model: openai/whisper-large-v3
12
+ library_name: transformers
13
+ ---
14
+
15
+ # whisper-large-v3-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** whisper-large-v3
19
+ - **Input:** Audio-Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Activation quantization:** INT4
24
+ - **Release Date:** 04/16/2025
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) to INT4 data type, ready for inference with vLLM >= 0.5.2.
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm.assets.audio import AudioAsset
42
+ from vllm import LLM, SamplingParams
43
+
44
+ # prepare model
45
+ llm = LLM(
46
+ model="neuralmagic/whisper-large-v3-quantized.w4a16",
47
+ max_model_len=448,
48
+ max_num_seqs=400,
49
+ limit_mm_per_prompt={"audio": 1},
50
+ )
51
+
52
+ # prepare inputs
53
+ inputs = { # Test explicit encoder/decoder prompt
54
+ "encoder_prompt": {
55
+ "prompt": "",
56
+ "multi_modal_data": {
57
+ "audio": AudioAsset("winning_call").audio_and_sample_rate,
58
+ },
59
+ },
60
+ "decoder_prompt": "<|startoftranscript|>",
61
+ }
62
+
63
+ # generate response
64
+ print("========== SAMPLE GENERATION ==============")
65
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
66
+ print(f"PROMPT : {outputs[0].prompt}")
67
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
68
+ print("==========================================")
69
+ ```
70
+
71
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
+
73
+ ## Creation
74
+
75
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
76
+
77
+ <details>
78
+ <summary>Model Creation Code</summary>
79
+
80
+ ```bash
81
+ python quantize.py --model_path openai/whisper-large-v3 --quant_path "output_dir/whisper-large-v3-quantized.w4a16" --calib_size 3072 --dampening_frac 0.01 --actorder weight
82
+ ```
83
+
84
+
85
+ ```python
86
+ import torch
87
+ import argparse
88
+ from datasets import load_dataset
89
+ from transformers import WhisperProcessor
90
+ from llmcompressor import oneshot
91
+ from llmcompressor.modifiers.quantization import GPTQModifier
92
+ from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
93
+ import os
94
+ from compressed_tensors.quantization import QuantizationArgs, QuantizationType, QuantizationStrategy, ActivationOrdering, QuantizationScheme
95
+ from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
96
+
97
+ parser = argparse.ArgumentParser()
98
+ parser.add_argument('--model_path', type=str)
99
+ parser.add_argument('--quant_path', type=str)
100
+ parser.add_argument('--calib_size', type=int, default=256)
101
+ parser.add_argument('--dampening_frac', type=float, default=0.1)
102
+ parser.add_argument('--observer', type=str, default="minmax")
103
+ parser.add_argument('--actorder', type=str, default="dynamic")
104
+ parser.add_argument('--group_size', type=int, default=128)
105
+ parser.add_argument('--save_dir', type=str, required=True)
106
+
107
+
108
+ args = parser.parse_args()
109
+ model_id = args.model_path
110
+
111
+ model = TraceableWhisperForConditionalGeneration.from_pretrained(
112
+ model_id,
113
+ device_map="auto",
114
+ torch_dtype="auto",
115
+ )
116
+ model.config.forced_decoder_ids = None
117
+ processor = WhisperProcessor.from_pretrained(model_id)
118
+
119
+ # Configure processor the dataset task.
120
+ processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")
121
+
122
+ # Select calibration dataset.
123
+ DATASET_ID = "MLCommons/peoples_speech"
124
+ DATASET_SUBSET = "test"
125
+ DATASET_SPLIT = "test"
126
+
127
+ # Select number of samples for calibration. 512 samples is a good place to start.
128
+ # Increasing the number of samples can improve accuracy.
129
+
130
+ NUM_CALIBRATION_SAMPLES = args.calib_size
131
+ MAX_SEQUENCE_LENGTH = 2048
132
+ dampening_frac=args.dampening_frac
133
+ actorder_arg=args.actorder
134
+ group_size=args.group_size
135
+
136
+ # Load dataset and preprocess.
137
+ ds = load_dataset(
138
+ DATASET_ID,
139
+ DATASET_SUBSET,
140
+ split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
141
+ trust_remote_code=True,
142
+ )
143
+
144
+ def preprocess(example):
145
+ return {
146
+ "array": example["audio"]["array"],
147
+ "sampling_rate": example["audio"]["sampling_rate"],
148
+ "text": " " + example["text"].capitalize(),
149
+ }
150
+
151
+ ds = ds.map(preprocess, remove_columns=ds.column_names)
152
+
153
+ # Process inputs.
154
+ def process(sample):
155
+ inputs = processor(
156
+ audio=sample["array"],
157
+ sampling_rate=sample["sampling_rate"],
158
+ text=sample["text"],
159
+ add_special_tokens=True,
160
+ return_tensors="pt",
161
+ )
162
+
163
+ inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
164
+ inputs["decoder_input_ids"] = inputs["labels"]
165
+ del inputs["labels"]
166
+
167
+ return inputs
168
+
169
+ ds = ds.map(process, remove_columns=ds.column_names)
170
+
171
+ # Define a oneshot data collator for multimodal inputs.
172
+ def data_collator(batch):
173
+ assert len(batch) == 1
174
+ return {key: torch.tensor(value) for key, value in batch[0].items()}
175
+
176
+ ignore=["lm_head"]
177
+
178
+ # Recipe
179
+ recipe = GPTQModifier(
180
+ targets="Linear",
181
+ config_groups={
182
+ "config_group": QuantizationScheme(
183
+ targets=["Linear"],
184
+ weights=QuantizationArgs(
185
+ num_bits=4,
186
+ type=QuantizationType.INT,
187
+ strategy=QuantizationStrategy.GROUP,
188
+ group_size=group_size,
189
+ symmetric=True,
190
+ dynamic=False,
191
+ actorder=getattr(ActivationOrdering, actorder_arg.upper()),
192
+ ),
193
+ ),
194
+ },
195
+ sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"],
196
+ ignore=["re:.*lm_head"],
197
+ update_size=NUM_CALIBRATION_SAMPLES,
198
+ dampening_frac=dampening_frac
199
+ )
200
+
201
+ # Apply algorithms.
202
+ oneshot(
203
+ model=model,
204
+ dataset=ds,
205
+ recipe=recipe,
206
+ max_seq_length=MAX_SEQUENCE_LENGTH,
207
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
208
+ data_collator=data_collator,
209
+ )
210
+
211
+
212
+ # Save to disk compressed.
213
+ save_name = f"{model_id.split('/')[-1]}-quantized.w4a16"
214
+ save_path = os.path.join(args.save_dir, save_name)
215
+ print("Saving model:", save_path)
216
+ model.save_pretrained(save_path, save_compressed=True)
217
+ processor.save_pretrained(save_path)
218
+ ```
219
+ </details>
220
+
221
+ ## Evaluation
222
+
223
+ The model was evaluated on [LibriSpeech](https://huggingface.co/datasets/lmms-lab/librispeech) and [Fleurs](https://huggingface.co/datasets/lmms-lab/fleurs) datasets using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), via the following commands:
224
+
225
+ <details>
226
+ <summary>Evaluation Commands</summary>
227
+
228
+ Librispeech:
229
+ ```
230
+ lmms-eval \
231
+ --model=whisper_vllm \
232
+ --model_args="pretrained=neuralmagic-ent/whisper-large-v3-quantized.w4a16" \
233
+ --batch_size 64 \
234
+ --output_path <output_file_path> \
235
+ --tasks librispeech
236
+ ```
237
+
238
+ Fleurs:
239
+ ```
240
+ lmms-eval \
241
+ --model=whisper_vllm \
242
+ --model_args="pretrained=neuralmagic-ent/whisper-large-v3-quantized.w4a16" \
243
+ --batch_size 64 \
244
+ --output_path <output_file_path> \
245
+ --tasks fleurs
246
+ ```
247
+ </details>
248
+
249
+ <table>
250
+ <thead>
251
+ <tr>
252
+ <th>Benchmark</th>
253
+ <th>Split</th>
254
+ <th>BF16</th>
255
+ <th>W4A16</th>
256
+ <th>Recovery (%)</th>
257
+ </tr>
258
+ </thead>
259
+ <tbody>
260
+ <tr>
261
+ <td rowspan="2"><b>LibriSpeech (WER)</b></td>
262
+ <td>test-clean</td>
263
+ <td></td>
264
+ <td></td>
265
+ <td></td>
266
+ </tr>
267
+ <tr>
268
+ <td>test-other</td>
269
+ <td></td>
270
+ <td></td>
271
+ <td></td>
272
+ </tr>
273
+ <tr>
274
+ <td rowspan="3"><b>Fleurs (X→en, BLEU)</b></td>
275
+ <td>cmn_hans_cn</td>
276
+ <td></td>
277
+ <td></td>
278
+ <td></td>
279
+ </tr>
280
+ <tr>
281
+ <td>en</td>
282
+ <td></td>
283
+ <td></td>
284
+ <td></td>
285
+ </tr>
286
+ <tr>
287
+ <td>yue_hant_hk</td>
288
+ <td></td>
289
+ <td></td>
290
+ <td></td>
291
+ </tr>
292
+ </tbody>
293
+ </table>