amd
/

Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm

Model card Files Files and versions

xet

Community

seungrok81 commited on Aug 13, 2024

Commit

cdaeda9

verified ·

1 Parent(s): 7beb97b

Create README.md

Browse files

Files changed (1) hide show

README.md +98 -0

README.md ADDED Viewed

	@@ -0,0 +1,98 @@

+---
+license: llama3.1
+---
+## Introduction
+This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
+For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).
+## Quickstart
+To run this fp8 model on vLLM framework,
+### Modle Preparation
+1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.
+```sh
+docker build -f Dockerfile_amd -t vllm_test .
+docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
+```
+2. clone the baseline [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
+3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm).
+4. move llama.safetensors and llama.json from [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm) to the saved directory of [Meta-Llama-3.1-8B-Instruct] by this command. Model snapshot commit# 8c22764a7e3675c50d4c7c9a4edb474456022b16 can be different.
+```sh
+cp llama.json ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/.
+cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/.
+```
+### Running fp8 model
+```sh
+# single GPU
+python run_vllm_fp8.py
+# 8 GPUs
+torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
+```
+```python
+# run_vllm_fp8.py
+from vllm import LLM, SamplingParams
+prompt = "Write me an essay about bear and knight"
+model_name="/workspace/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/"
+tp=1 # single GPU
+tp=8 # 8 GPUs
+model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
+sampling_params = SamplingParams(
+                  top_k=1.0,
+                  ignore_eos=True,
+                  max_tokens=200,
+                  )
+result = model.generate(prompt, sampling_params=sampling_params)
+print(result)
+```
+### Running fp16 model (For comparison)
+```sh
+# single GPU
+python run_vllm_fp8.py
+# 8 GPUs
+torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
+```
+```python
+# run_vllm_fp16.py
+from vllm import LLM, SamplingParams
+prompt = "Write me an essay about bear and knight"
+model_name="/workspace/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/"
+tp=1 # single GPU
+tp=8 # 8 GPUs
+model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
+sampling_params = SamplingParams(
+                  top_k=1.0,
+                  ignore_eos=True,
+                  max_tokens=200,
+                  )
+result = model.generate(prompt, sampling_params=sampling_params)
+print(result)
+```
+#### License
+Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.