Update README.md
Browse files
README.md
CHANGED
@@ -186,19 +186,6 @@ and use a token with write access, from https://huggingface.co/settings/tokens
|
|
186 |
|
187 |
# Model Quality
|
188 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
189 |
-
Need to install lm-eval from source:
|
190 |
-
https://github.com/EleutherAI/lm-evaluation-harness#install
|
191 |
-
|
192 |
-
|
193 |
-
## baseline
|
194 |
-
```Shell
|
195 |
-
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
196 |
-
```
|
197 |
-
|
198 |
-
## float8 dynamic activation and float8 weight quantization (float8dq)
|
199 |
-
```Shell
|
200 |
-
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
|
201 |
-
```
|
202 |
|
203 |
| Benchmark | | |
|
204 |
|----------------------------------|----------------|-------------------------------|
|
@@ -222,6 +209,25 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
|
|
222 |
| mathqa (0-shot) | 42.31 | 42.51 |
|
223 |
| **Overall** | **55.35** | **55.11** |
|
224 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
225 |
# Peak Memory Usage
|
226 |
|
227 |
|
@@ -233,7 +239,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
|
|
233 |
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
|
234 |
|
235 |
|
236 |
-
|
|
|
237 |
|
238 |
We can use the following code to get a sense of peak memory usage during inference:
|
239 |
|
@@ -278,6 +285,8 @@ mem = torch.cuda.max_memory_reserved() / 1e9
|
|
278 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
279 |
```
|
280 |
|
|
|
|
|
281 |
# Model Performance
|
282 |
|
283 |
## Results (H100 machine)
|
@@ -291,6 +300,9 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
291 |
|
292 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
293 |
|
|
|
|
|
|
|
294 |
## Setup
|
295 |
Get vllm source code:
|
296 |
```Shell
|
@@ -351,6 +363,7 @@ Client:
|
|
351 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
|
352 |
```
|
353 |
|
|
|
354 |
|
355 |
|
356 |
# Disclaimer
|
|
|
186 |
|
187 |
# Model Quality
|
188 |
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
189 |
|
190 |
| Benchmark | | |
|
191 |
|----------------------------------|----------------|-------------------------------|
|
|
|
209 |
| mathqa (0-shot) | 42.31 | 42.51 |
|
210 |
| **Overall** | **55.35** | **55.11** |
|
211 |
|
212 |
+
<details>
|
213 |
+
<summary> Reproduce Model Quality Results </summary>
|
214 |
+
|
215 |
+
Need to install lm-eval from source:
|
216 |
+
https://github.com/EleutherAI/lm-evaluation-harness#install
|
217 |
+
|
218 |
+
|
219 |
+
## baseline
|
220 |
+
```Shell
|
221 |
+
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
|
222 |
+
```
|
223 |
+
|
224 |
+
## float8 dynamic activation and float8 weight quantization (float8dq)
|
225 |
+
```Shell
|
226 |
+
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
|
227 |
+
```
|
228 |
+
</details>
|
229 |
+
|
230 |
+
|
231 |
# Peak Memory Usage
|
232 |
|
233 |
|
|
|
239 |
| Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
|
240 |
|
241 |
|
242 |
+
<details>
|
243 |
+
<summary> Reproduce Peak Memory Usage Results </summary>
|
244 |
|
245 |
We can use the following code to get a sense of peak memory usage during inference:
|
246 |
|
|
|
285 |
print(f"Peak Memory Usage: {mem:.02f} GB")
|
286 |
```
|
287 |
|
288 |
+
</details>
|
289 |
+
|
290 |
# Model Performance
|
291 |
|
292 |
## Results (H100 machine)
|
|
|
300 |
|
301 |
Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
|
302 |
|
303 |
+
<details>
|
304 |
+
<summary> Reproduce Model Performance Results </summary>
|
305 |
+
|
306 |
## Setup
|
307 |
Get vllm source code:
|
308 |
```Shell
|
|
|
363 |
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
|
364 |
```
|
365 |
|
366 |
+
</details>
|
367 |
|
368 |
|
369 |
# Disclaimer
|