| --- |
| license: llama3.1 |
| --- |
| |
| ## Introduction |
| This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct). |
| For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html). |
|
|
| ## Quickstart |
|
|
| To run this fp8 model on vLLM framework, |
|
|
| ### Modle Preparation |
| 1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container. |
|
|
| ```sh |
| docker build -f Dockerfile.rocm -t vllm_test . |
| docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest |
| ``` |
|
|
| 2. clone the baseline [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct). |
| 3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors. |
| |
| ```sh |
| python merge.py |
| ``` |
| |
| 4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different. |
| ```sh |
| cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/. |
| cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/. |
| ``` |
| |
| ### Running fp8 model |
| |
| ```sh |
| # 8 GPUs |
| torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py |
| ``` |
| |
| ```python |
| # run_vllm_fp8.py |
| from vllm import LLM, SamplingParams |
| prompt = "Write me an essay about bear and knight" |
| |
| model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/" |
| tp=8 # 8 GPUs |
| |
| model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors") |
| sampling_params = SamplingParams( |
| top_k=1.0, |
| ignore_eos=True, |
| max_tokens=200, |
| ) |
| result = model.generate(prompt, sampling_params=sampling_params) |
| print(result) |
| ``` |
| ### Running fp16 model (For comparison) |
| |
| ```sh |
| # 8 GPUs |
| torchrun --standalone --nproc_per_node=8 run_vllm_fp16.py |
| ``` |
| |
| ```python |
| # run_vllm_fp16.py |
| from vllm import LLM, SamplingParams |
| prompt = "Write me an essay about bear and knight" |
| |
| model_name="models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/" |
| tp=8 # 8 GPUs |
| model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16") |
| sampling_params = SamplingParams( |
| top_k=1.0, |
| ignore_eos=True, |
| max_tokens=200, |
| ) |
| result = model.generate(prompt, sampling_params=sampling_params) |
| print(result) |
| ``` |
| ## fp8 gemm_tuning |
| Will update soon. |
| |
| #### License |
| Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved. |
| |
| Licensed under the Apache License, Version 2.0 (the "License"); |
| you may not use this file except in compliance with the License. |
| You may obtain a copy of the License at |
| |
| http://www.apache.org/licenses/LICENSE-2.0 |
| |
| Unless required by applicable law or agreed to in writing, software |
| distributed under the License is distributed on an "AS IS" BASIS, |
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| See the License for the specific language governing permissions and |
| limitations under the License. |