Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
How to Deploy as an Endpoint
Upload the
adapter
directory (produced by training) to your Hugging Face Hub repository.- The directory should contain
adapter_config.json
,adapter_model.bin
, and tokenizer files.
- The directory should contain
Add a
handler.py
file to define the endpoint logic.Push to the Hugging Face Hub.
Deploy as an Inference Endpoint via the Hugging Face UI.
Example handler.py
This file loads the base model and LoRA adapter, and exposes a __call__
method for inference.
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
class EndpointHandler:
def __init__(self, path="."):
# Load base model and tokenizer
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
# Load LoRA adapter
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
prompt = data["inputs"] if isinstance(data, dict) else data
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(**inputs, max_new_tokens=256)
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
return {"generated_text": decoded}
- Replace
<BASE_MODEL_ID>
with the correct base model (e.g.,google/gemma-2b
). - The endpoint will accept a JSON payload with an
inputs
field containing the prompt.
Notes
- Make sure your
requirements.txt
includestransformers
,peft
, andtorch
. - For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.
Quickstart
- Train your adapter with
train_gemma_unsloth.py
. - Upload the
adapter
directory andhandler.py
to your Hugging Face repo. - Deploy as an Inference Endpoint.
- Send requests to your endpoint!
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
## How to Deploy as an Endpoint
1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**
- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.
2. **Add a `handler.py` file to define the endpoint logic.**
3. **Push to the Hugging Face Hub.**
4. **Deploy as an Inference Endpoint via the Hugging Face UI.**
---
## Example `handler.py`
This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.
```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch
class EndpointHandler:
def __init__(self, path="."):
# Load base model and tokenizer
base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
# Load LoRA adapter
self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
self.model.eval()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
prompt = data["inputs"] if isinstance(data, dict) else data
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
output = self.model.generate(**inputs, max_new_tokens=256)
decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
return {"generated_text": decoded}
- Replace
<BASE_MODEL_ID>
with the correct base model (e.g.,google/gemma-2b
). - The endpoint will accept a JSON payload with an
inputs
field containing the prompt.
Notes
- Make sure your
requirements.txt
includestransformers
,peft
, andtorch
. - For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.
Quickstart
- Train your adapter with
train_gemma_unsloth.py
. - Upload the
adapter
directory andhandler.py
to your Hugging Face repo. - Deploy as an Inference Endpoint.
- Send requests to your endpoint!