Update README.md

Browse files

Files changed (1) hide show

README.md +27 -14

README.md CHANGED Viewed

@@ -186,19 +186,6 @@ and use a token with write access, from https://huggingface.co/settings/tokens
 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
-Need to install lm-eval from source:
-https://github.com/EleutherAI/lm-evaluation-harness#install
-## baseline
-```Shell
-lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
-```
-## float8 dynamic activation and float8 weight quantization (float8dq)
-```Shell
-lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
-```
 | Benchmark                        |                |                               |
 |----------------------------------|----------------|-------------------------------|
@@ -222,6 +209,25 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
 | mathqa (0-shot)                  | 42.31          |  42.51                        |
 | **Overall**                      | **55.35**       | **55.11**                    |
 # Peak Memory Usage
@@ -233,7 +239,8 @@ lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq
 | Peak Memory (GB) | 8.91           | 5.70 (36% reduction)           |
-## Benchmark Peak Memory
 We can use the following code to get a sense of peak memory usage during inference:
@@ -278,6 +285,8 @@ mem = torch.cuda.max_memory_reserved() / 1e9
 print(f"Peak Memory Usage: {mem:.02f} GB")
 ```
 # Model Performance
 ## Results (H100 machine)
@@ -291,6 +300,9 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
 ## Setup
 Get vllm source code:
 ```Shell
@@ -351,6 +363,7 @@ Client:
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
 ```
 # Disclaimer

 # Model Quality
 We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 | Benchmark                        |                |                               |
 |----------------------------------|----------------|-------------------------------|
 | mathqa (0-shot)                  | 42.31          |  42.51                        |
 | **Overall**                      | **55.35**       | **55.11**                    |
+<details>
+<summary> Reproduce Model Quality Results </summary>
+Need to install lm-eval from source:
+https://github.com/EleutherAI/lm-evaluation-harness#install
+## baseline
+```Shell
+lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
+```
+## float8 dynamic activation and float8 weight quantization (float8dq)
+```Shell
+lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-float8dq --tasks hellaswag --device cuda:0 --batch_size 8
+```
+</details>
 # Peak Memory Usage
 | Peak Memory (GB) | 8.91           | 5.70 (36% reduction)           |
+<details>
+<summary> Reproduce Peak Memory Usage Results </summary>
 We can use the following code to get a sense of peak memory usage during inference:
 print(f"Peak Memory Usage: {mem:.02f} GB")
 ```
+</details>
 # Model Performance
 ## Results (H100 machine)
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.
+<details>
+<summary> Reproduce Model Performance Results </summary>
 ## Setup
 Get vllm source code:
 ```Shell
 python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-float8dq --num-prompts 1
 ```
+</details>
 # Disclaimer