File size: 5,205 Bytes
91181f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65edee9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

## How to Deploy as an Endpoint

1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**

   - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

2. **Add a `handler.py` file to define the endpoint logic.**

3. **Push to the Hugging Face Hub.**

4. **Deploy as an Inference Endpoint via the Hugging Face UI.**

---

## Example `handler.py`

This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}
```

- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.

---

## Notes

- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
- For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.

---

## Quickstart

1. Train your adapter with `train_gemma_unsloth.py`.
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
3. Deploy as an Inference Endpoint.
4. Send requests to your endpoint!

````
# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

## How to Deploy as an Endpoint

1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**

   - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

2. **Add a `handler.py` file to define the endpoint logic.**

3. **Push to the Hugging Face Hub.**

4. **Deploy as an Inference Endpoint via the Hugging Face UI.**

---

## Example `handler.py`

This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

```python
from typing import Dict, Any
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel, PeftConfig
import torch

class EndpointHandler:
    def __init__(self, path="."):
        # Load base model and tokenizer
        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
        # Load LoRA adapter
        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
        self.model.eval()
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)

    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
        prompt = data["inputs"] if isinstance(data, dict) else data
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.model.generate(**inputs, max_new_tokens=256)
        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
        return {"generated_text": decoded}
````

- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.

---

## Notes

- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
- For large models, use an Inference Endpoint with GPU.
- You can customize the handler for chat formatting, streaming, etc.

---

## Quickstart

1. Train your adapter with `train_gemma_unsloth.py`.
2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
3. Deploy as an Inference Endpoint.
4. Send requests to your endpoint!