File size: 5,205 Bytes
91181f3 65edee9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
## How to Deploy as an Endpoint
1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**
- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.
2. **Add a `handler.py` file to define the endpoint logic.**
3. **Push to the Hugging Face Hub.**
4. **Deploy as an Inference Endpoint via the Hugging Face UI.**
---
## Example `handler.py`
This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.
```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
class EndpointHandler:
def __init__(self, path="."):
# Load base model and tokenizer
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
# Load LoRA adapter
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
prompt = data["inputs"] if isinstance(data, dict) else data
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(**inputs, max_new_tokens=256)
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
return {"generated_text": decoded}
```
- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.
---
## Notes
- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
- For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.
---
## Quickstart
1. Train your adapter with `train_gemma_unsloth.py`.
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
3. Deploy as an Inference Endpoint.
4. Send requests to your endpoint!
````
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
## How to Deploy as an Endpoint
1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**
- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.
2. **Add a `handler.py` file to define the endpoint logic.**
3. **Push to the Hugging Face Hub.**
4. **Deploy as an Inference Endpoint via the Hugging Face UI.**
---
## Example `handler.py`
This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.
```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
class EndpointHandler:
def __init__(self, path="."):
# Load base model and tokenizer
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
# Load LoRA adapter
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
prompt = data["inputs"] if isinstance(data, dict) else data
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(**inputs, max_new_tokens=256)
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
return {"generated_text": decoded}
````
- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.
---
## Notes
- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
- For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.
---
## Quickstart
1. Train your adapter with `train_gemma_unsloth.py`.
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
3. Deploy as an Inference Endpoint.
4. Send requests to your endpoint!
|