Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

How to Deploy as an Endpoint

Upload the adapter directory (produced by training) to your Hugging Face Hub repository.
- The directory should contain adapter_config.json, adapter_model.bin, and tokenizer files.
Add a handler.py file to define the endpoint logic.
Push to the Hugging Face Hub.
Deploy as an Inference Endpoint via the Hugging Face UI.

Example `handler.py`

This file loads the base model and LoRA adapter, and exposes a __call__ method for inference.

from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}

Replace <BASE_MODEL_ID> with the correct base model (e.g., google/gemma-2b).
The endpoint will accept a JSON payload with an inputs field containing the prompt.

Notes

Make sure your requirements.txt includes transformers, peft, and torch.
For large models, use an Inference Endpoint with GPU.
You can customize the handler for chat formatting, streaming, etc.

Quickstart

Train your adapter with train_gemma_unsloth.py.
Upload the adapter directory and handler.py to your Hugging Face repo.
Deploy as an Inference Endpoint.
Send requests to your endpoint!

# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

## How to Deploy as an Endpoint

1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**

   - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

2. **Add a `handler.py` file to define the endpoint logic.**

3. **Push to the Hugging Face Hub.**

4. **Deploy as an Inference Endpoint via the Hugging Face UI.**

---

## Example `handler.py`

This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}

Replace <BASE_MODEL_ID> with the correct base model (e.g., google/gemma-2b).
The endpoint will accept a JSON payload with an inputs field containing the prompt.

Notes

Make sure your requirements.txt includes transformers, peft, and torch.
For large models, use an Inference Endpoint with GPU.
You can customize the handler for chat formatting, streaming, etc.

Quickstart

Train your adapter with train_gemma_unsloth.py.
Upload the adapter directory and handler.py to your Hugging Face repo.
Deploy as an Inference Endpoint.
Send requests to your endpoint!

Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

How to Deploy as an Endpoint

Example handler.py

Notes

Quickstart

Notes

Quickstart

Example `handler.py`