|
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter |
|
|
|
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint. |
|
|
|
## How to Deploy as an Endpoint |
|
|
|
1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.** |
|
|
|
- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files. |
|
|
|
2. **Add a `handler.py` file to define the endpoint logic.** |
|
|
|
3. **Push to the Hugging Face Hub.** |
|
|
|
4. **Deploy as an Inference Endpoint via the Hugging Face UI.** |
|
|
|
--- |
|
|
|
## Example `handler.py` |
|
|
|
This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference. |
|
|
|
```python |
|
from typing import Dict, Any |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from peft import PeftModel, PeftConfig |
|
import torch |
|
|
|
class EndpointHandler: |
|
def __init__(self, path="."): |
|
# Load base model and tokenizer |
|
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b" |
|
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) |
|
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True) |
|
# Load LoRA adapter |
|
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter") |
|
self.model.eval() |
|
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
self.model.to(self.device) |
|
|
|
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]: |
|
prompt = data["inputs"] if isinstance(data, dict) else data |
|
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) |
|
with torch.no_grad(): |
|
output = self.model.generate(**inputs, max_new_tokens=256) |
|
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True) |
|
return {"generated_text": decoded} |
|
``` |
|
|
|
- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`). |
|
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt. |
|
|
|
--- |
|
|
|
## Notes |
|
|
|
- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`. |
|
- For large models, use an Inference Endpoint with GPU. |
|
- You can customize the handler for chat formatting, streaming, etc. |
|
|
|
--- |
|
|
|
## Quickstart |
|
|
|
1. Train your adapter with `train_gemma_unsloth.py`. |
|
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo. |
|
3. Deploy as an Inference Endpoint. |
|
4. Send requests to your endpoint! |
|
|
|
```` |
|
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter |
|
|
|
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint. |
|
|
|
## How to Deploy as an Endpoint |
|
|
|
1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.** |
|
|
|
- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files. |
|
|
|
2. **Add a `handler.py` file to define the endpoint logic.** |
|
|
|
3. **Push to the Hugging Face Hub.** |
|
|
|
4. **Deploy as an Inference Endpoint via the Hugging Face UI.** |
|
|
|
--- |
|
|
|
## Example `handler.py` |
|
|
|
This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference. |
|
|
|
```python |
|
from typing import Dict, Any |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from peft import PeftModel, PeftConfig |
|
import torch |
|
|
|
class EndpointHandler: |
|
def __init__(self, path="."): |
|
# Load base model and tokenizer |
|
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b" |
|
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True) |
|
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True) |
|
# Load LoRA adapter |
|
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter") |
|
self.model.eval() |
|
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
self.model.to(self.device) |
|
|
|
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]: |
|
prompt = data["inputs"] if isinstance(data, dict) else data |
|
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device) |
|
with torch.no_grad(): |
|
output = self.model.generate(**inputs, max_new_tokens=256) |
|
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True) |
|
return {"generated_text": decoded} |
|
```` |
|
|
|
- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`). |
|
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt. |
|
|
|
--- |
|
|
|
## Notes |
|
|
|
- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`. |
|
- For large models, use an Inference Endpoint with GPU. |
|
- You can customize the handler for chat formatting, streaming, etc. |
|
|
|
--- |
|
|
|
## Quickstart |
|
|
|
1. Train your adapter with `train_gemma_unsloth.py`. |
|
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo. |
|
3. Deploy as an Inference Endpoint. |
|
4. Send requests to your endpoint! |
|
|