Spaces:

cong182
/

firstAI

Sleeping

firstAI / README_DEPLOY_HF.md

ndc8

65edee9 23 days ago

5.21 kB

	# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

	This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

	## How to Deploy as an Endpoint

	1. Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.

	- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

	2. Add a `handler.py` file to define the endpoint logic.

	3. Push to the Hugging Face Hub.

	4. Deploy as an Inference Endpoint via the Hugging Face UI.

	---

	## Example `handler.py`

	This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

	```python
	from typing import Dict, Any
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel, PeftConfig
	import torch

	class EndpointHandler:
	def __init__(self, path="."):
	# Load base model and tokenizer
	base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
	self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
	base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
	# Load LoRA adapter
	self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
	self.model.eval()
	self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	self.model.to(self.device)

	def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
	prompt = data["inputs"] if isinstance(data, dict) else data
	inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
	with torch.no_grad():
	output = self.model.generate(**inputs, max_new_tokens=256)
	decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
	return {"generated_text": decoded}
	```

	- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
	- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.

	---

	## Notes

	- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
	- For large models, use an Inference Endpoint with GPU.
	- You can customize the handler for chat formatting, streaming, etc.

	---

	## Quickstart

	1. Train your adapter with `train_gemma_unsloth.py`.
	2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
	3. Deploy as an Inference Endpoint.
	4. Send requests to your endpoint!

	````
	# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter

	This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.

	## How to Deploy as an Endpoint

	1. Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.

	- The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.

	2. Add a `handler.py` file to define the endpoint logic.

	3. Push to the Hugging Face Hub.

	4. Deploy as an Inference Endpoint via the Hugging Face UI.

	---

	## Example `handler.py`

	This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.

	```python
	from typing import Dict, Any
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel, PeftConfig
	import torch

	class EndpointHandler:
	def __init__(self, path="."):
	# Load base model and tokenizer
	base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
	self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
	base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
	# Load LoRA adapter
	self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
	self.model.eval()
	self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	self.model.to(self.device)

	def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
	prompt = data["inputs"] if isinstance(data, dict) else data
	inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
	with torch.no_grad():
	output = self.model.generate(**inputs, max_new_tokens=256)
	decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
	return {"generated_text": decoded}
	````

	- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
	- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.

	---

	## Notes

	- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
	- For large models, use an Inference Endpoint with GPU.
	- You can customize the handler for chat formatting, streaming, etc.

	---

	## Quickstart

	1. Train your adapter with `train_gemma_unsloth.py`.
	2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
	3. Deploy as an Inference Endpoint.
	4. Send requests to your endpoint!