firstAI / README_DEPLOY_HF.md
ndc8
aa
65edee9

Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

How to Deploy as an Endpoint

  1. Upload the adapter directory (produced by training) to your Hugging Face Hub repository.

    • The directory should contain adapter_config.json, adapter_model.bin, and tokenizer files.
  2. Add a handler.py file to define the endpoint logic.

  3. Push to the Hugging Face Hub.

  4. Deploy as an Inference Endpoint via the Hugging Face UI.


Example handler.py

This file loads the base model and LoRA adapter, and exposes a __call__ method for inference.

from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}
  • Replace <BASE_MODEL_ID> with the correct base model (e.g., google/gemma-2b).
  • The endpoint will accept a JSON payload with an inputs field containing the prompt.

Notes

  • Make sure your requirements.txt includes transformers, peft, and torch.
  • For large models, use an Inference Endpoint with GPU.
  • You can customize the handler for chat formatting, streaming, etc.

Quickstart

  1. Train your adapter with train_gemma_unsloth.py.
  2. Upload the adapter directory and handler.py to your Hugging Face repo.
  3. Deploy as an Inference Endpoint.
  4. Send requests to your endpoint!
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

## How to Deploy as an Endpoint

1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**

   - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

2. **Add a `handler.py` file to define the endpoint logic.**

3. **Push to the Hugging Face Hub.**

4. **Deploy as an Inference Endpoint via the Hugging Face UI.**

---

## Example `handler.py`

This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}
  • Replace <BASE_MODEL_ID> with the correct base model (e.g., google/gemma-2b).
  • The endpoint will accept a JSON payload with an inputs field containing the prompt.

Notes

  • Make sure your requirements.txt includes transformers, peft, and torch.
  • For large models, use an Inference Endpoint with GPU.
  • You can customize the handler for chat formatting, streaming, etc.

Quickstart

  1. Train your adapter with train_gemma_unsloth.py.
  2. Upload the adapter directory and handler.py to your Hugging Face repo.
  3. Deploy as an Inference Endpoint.
  4. Send requests to your endpoint!