jerryzh168 commited on
Commit
fec8eb6
·
verified ·
1 Parent(s): 7e6fb4f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -14
README.md CHANGED
@@ -186,19 +186,6 @@ and use a token with write access, from https://huggingface.co/settings/tokens
186
 
187
  # Model Quality
188
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
189
- Need to install lm-eval from source:
190
- https://github.com/EleutherAI/lm-evaluation-harness#install
191
-
192
-
193
- ## baseline
194
- ```Shell
195
- lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
196
- ```
197
-
198
- ## float8 dynamic activation and float8 weight quantization (float8dq)
199
- ```Shell
200
- lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
201
- ```
202
 
203
  | Benchmark | | |
204
  |----------------------------------|----------------|-------------------------------|
@@ -222,6 +209,25 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
222
  | mathqa (0-shot) | 42.31 | 42.51 |
223
  | **Overall** | **55.35** | **55.11** |
224
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
  # Peak Memory Usage
226
 
227
 
@@ -233,7 +239,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
233
  | Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
234
 
235
 
236
- ## Benchmark Peak Memory
 
237
 
238
  We can use the following code to get a sense of peak memory usage during inference:
239
 
@@ -278,6 +285,8 @@ mem = torch.cuda.max_memory_reserved() / 1e9
278
  print(f"Peak Memory Usage: {mem:.02f} GB")
279
  ```
280
 
 
 
281
  # Model Performance
282
 
283
  ## Results (H100 machine)
@@ -291,6 +300,9 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
291
 
292
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
293
 
 
 
 
294
  ## Setup
295
  Get vllm source code:
296
  ```Shell
@@ -351,6 +363,7 @@ Client:
351
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
352
  ```
353
 
 
354
 
355
 
356
  # Disclaimer
 
186
 
187
  # Model Quality
188
  We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
  | Benchmark | | |
191
  |----------------------------------|----------------|-------------------------------|
 
209
  | mathqa (0-shot) | 42.31 | 42.51 |
210
  | **Overall** | **55.35** | **55.11** |
211
 
212
+ <details>
213
+ <summary> Reproduce Model Quality Results </summary>
214
+
215
+ Need to install lm-eval from source:
216
+ https://github.com/EleutherAI/lm-evaluation-harness#install
217
+
218
+
219
+ ## baseline
220
+ ```Shell
221
+ lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
222
+ ```
223
+
224
+ ## float8 dynamic activation and float8 weight quantization (float8dq)
225
+ ```Shell
226
+ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
227
+ ```
228
+ </details>
229
+
230
+
231
  # Peak Memory Usage
232
 
233
 
 
239
  | Peak Memory (GB) | 8.91 | 5.70 (36% reduction) |
240
 
241
 
242
+ <details>
243
+ <summary> Reproduce Peak Memory Usage Results </summary>
244
 
245
  We can use the following code to get a sense of peak memory usage during inference:
246
 
 
285
  print(f"Peak Memory Usage: {mem:.02f} GB")
286
  ```
287
 
288
+ </details>
289
+
290
  # Model Performance
291
 
292
  ## Results (H100 machine)
 
300
 
301
  Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
302
 
303
+ <details>
304
+ <summary> Reproduce Model Performance Results </summary>
305
+
306
  ## Setup
307
  Get vllm source code:
308
  ```Shell
 
363
  python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
364
  ```
365
 
366
+ </details>
367
 
368
 
369
  # Disclaimer