Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ base_model:
 pipeline_tag: text-generation
 ---
-[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 36% VRAM reduction, 15-20% speedup and little to no accuracy impact on H100.
 # Inference with vLLM
 Install vllm nightly to get some recent changes:
@@ -281,13 +281,13 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 # Model Performance
 ## Results (H100 machine)
-| Benchmark                        |                |                               |
-|----------------------------------|----------------|-------------------------------|
-|                                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq  |
-| latency (batch_size=1)           | 1.64s          | 1.41s (16% speedup)           |
-| latency (batch_size=128)         | 3.1s           | 2.72s (14% speedup)           |
-| serving (num_prompts=1)          | 1.35 req/s     | 1.57 req/s (16% speedup)      |
-| serving (num_prompts=1000)       | 66.68 req/s    | 80.53 req/s (21% speedup)     |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.

 pipeline_tag: text-generation
 ---
+[Phi4-mini](https://huggingface.co/microsoft/Phi-4-mini-instruct) model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) float8 dynamic activation and float8 weight quantization (per row granularity), by PyTorch team. Use it directly, or serve using [vLLM](https://docs.vllm.ai/en/latest/) with 36% VRAM reduction, 1.15x-1.2x speedup and little to no accuracy impact on H100.
 # Inference with vLLM
 Install vllm nightly to get some recent changes:
 # Model Performance
 ## Results (H100 machine)
+| Benchmark                        |                |                                 |
+|----------------------------------|----------------|---------------------------------|
+|                                  | Phi-4 mini-Ins | Phi-4-mini-instruct-float8dq    |
+| latency (batch_size=1)           | 1.64s          | 1.41s (1.16x speedup)           |
+| latency (batch_size=128)         | 3.1s           | 2.72s (1.14x speedup)           |
+| serving (num_prompts=1)          | 1.35 req/s     | 1.57 req/s (1.16x speedup)      |
+| serving (num_prompts=1000)       | 66.68 req/s    | 80.53 req/s (1.21x speedup)     |
 Note the result of latency (benchmark_latency) is in seconds, and serving (benchmark_serving) is in number of requests per second.