ndc8 commited on
Commit
65edee9
·
1 Parent(s): 7ecd130
.github/instructions/recheck.instructions.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ applyTo: "**"
3
+ ---
4
+
5
+ # The QC Mindset: Architect of Trust
6
+
7
+ At the highest level, Quality Control is not about finding defects; it's about **engineering confidence**. Your role is to guarantee a resilient system that protects business value, customer trust, and brand reputation. You are not only a gatekeeper who inspects products at the end of a line, but you are an architect who designs quality into the very foundation of the process.
8
+
9
+ ---
10
+
11
+ # CMD The Three Pillars of High-Level QC Thinking
12
+
13
+ Your strategic thinking should be built on three core pillars that elevate QC from a technical function to a business-critical one.
14
+
15
+ ---
16
+
17
+ ## 1. Think Like a Risk Manager, Not a Feature Tester
18
+
19
+ Your primary concern isn't _"Does this button work?"_ but **"What is the business impact if this system fails?"**
20
+
21
+ ### Shift your focus from individual bugs to a portfolio of risks:
22
+
23
+ - View every potential quality issue through an **economic lens**
24
+ - Quantify failures in terms of:
25
+ - 💰 **Cost impact**
26
+ - 📉 **Customer churn potential**
27
+ - ⚖️ **Legal/regulatory exposure**
28
+ - 🔥 **Reputational damage**
29
+ - Reframe quality discussions from **technical debates** into **strategic business decisions**
30
+ - Position yourself as a **vital strategic partner to leadership**
31
+
32
+ ---
33
+
34
+ ## 2. Think Like a System Designer, Not an Inspector
35
+
36
+ Your goal is **prevention, not detection**. A system that relies on end-stage inspection to catch errors is fundamentally broken.
37
+
38
+ ### Design a "Quality Immune System":
39
+
40
+ - Analyze the **entire development lifecycle**
41
+ - Identify **weak points where defects originate**
42
+ - Build **feedback loops** and **automated checks**
43
+ - Establish **cultural standards** that make defects hard to survive
44
+ - Measure success by **defects prevented**, not **bugs found**
45
+
46
+ > **Success Metric**: Fewer defects created = stronger quality architecture
47
+
48
+ ---
49
+
50
+ ## 3. Think Like a Governor, Not a Policeman
51
+
52
+ Your authority comes from **objective, data-driven standards**, not subjective opinion. You cannot scale quality based on individual heroics or personal judgment.
53
+
54
+ ### Govern Through Standards:
55
+
56
+ - Establish clear, **non-negotiable "Definition of Done"**
57
+ - Create your **quality constitution** understood by all
58
+ - Shift from **manual inspection** to **process auditing**
59
+ - Focus on **analyzing quality data** and **improving standards**
60
+ - Make quality **systemic, not situational**
61
+
62
+ ---
63
+
64
+ # The Ultimate Litmus Test: The Legacy Question
65
+
66
+ For any major process change, strategic decision, or new initiative, ask the ultimate high-level question:
67
+
68
+ > **"If I left the company tomorrow, would the quality system I built continue to protect the business on its own?"**
69
+
70
+ ### If NO:
71
+
72
+ - Quality still depends too heavily on **individuals**
73
+ - System lacks **institutional resilience**
74
+ - Standards need **greater automation and documentation**
75
+
76
+ ### If YES:
77
+
78
+ - You've created **institutionalized quality**
79
+ - Built **cultural and operational resilience**
80
+ - Designed a system that **operates independently of any single person**
81
+
82
+ ---
83
+
84
+ # Your Ultimate Mission
85
+
86
+ > **Transform quality from a function performed by people into a system that performs for people.**
87
+
88
+ Your ultimate goal is to make quality so inherent in the culture that the dedicated QC function can focus entirely on **strategic risk management** and **future challenges**, rather than inspecting daily deliverables.
89
+
90
+ Create systems that **scale without you** — that's the mark of a true Quality Architect.
Dockerfile CHANGED
@@ -1,34 +0,0 @@
1
- FROM python:3.11-slim
2
-
3
- # Set working directory
4
- WORKDIR /app
5
-
6
- # Install system dependencies
7
- RUN apt-get update && apt-get install -y \
8
- git \
9
- curl \
10
- build-essential \
11
- cmake \
12
- pkg-config \
13
- python3-dev \
14
- && rm -rf /var/lib/apt/lists/*
15
-
16
- # Copy requirements first for better caching
17
- COPY requirements.txt .
18
-
19
- # Install Python dependencies
20
- RUN pip install --no-cache-dir -r requirements.txt
21
-
22
- # Copy application code
23
- COPY . .
24
-
25
- # Expose port
26
- EXPOSE 8000
27
-
28
- # Set environment variables
29
- ENV PYTHONUNBUFFERED=1
30
- ENV HOST=0.0.0.0
31
- ENV PORT=8000
32
-
33
- # Run the application
34
- CMD ["python", "backend_service.py", "--host", "0.0.0.0", "--port", "8000"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
2
 
3
  This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.
 
1
+ # Hugging Face Spaces: FastAPI OpenAI-Compatible Backend
2
+
3
+ This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf).
4
+
5
+ ## Features
6
+
7
+ - OpenAI-compatible `/v1/chat/completions` endpoint
8
+ - Multimodal support (text + image, if model supports)
9
+ - Environment variable support via `.env`
10
+ - Hugging Face Spaces compatible (CPU or T4/RTX GPU)
11
+
12
+ ## Usage (Local)
13
+
14
+ ```bash
15
+ pip install -r requirements.txt
16
+ python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860
17
+ ```
18
+
19
+ ## Usage (Hugging Face Spaces)
20
+
21
+ - Push this repo to your Hugging Face Space
22
+ - Space will auto-launch with FastAPI backend
23
+ - Use `/v1/chat/completions` endpoint for OpenAI-compatible clients
24
+
25
+ ## Notes
26
+
27
+ - Only transformers models are supported (no GGUF/llama-cpp, no vLLM)
28
+ - Set your model in the `AI_MODEL` environment variable or edit `backend_service.py`
29
+ - For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file
30
+
31
+ ## Example curl
32
+
33
+ ```bash
34
+ curl -X POST https://<your-space>.hf.space/v1/chat/completions \
35
+ -H "Content-Type: application/json" \
36
+ -d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}'
37
+ ```
38
+
39
+ ---
40
+
41
+ For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker
42
+
43
+ # Fallback Logic
44
+
45
+ If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend.
46
+
47
  # Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
48
 
49
  This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.
README_DEPLOY_HF.md CHANGED
@@ -66,3 +66,73 @@ class EndpointHandler:
66
  2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
67
  3. Deploy as an Inference Endpoint.
68
  4. Send requests to your endpoint!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
  2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
67
  3. Deploy as an Inference Endpoint.
68
  4. Send requests to your endpoint!
69
+
70
+ ````
71
+ # Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
72
+
73
+ This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
74
+
75
+ ## How to Deploy as an Endpoint
76
+
77
+ 1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**
78
+
79
+ - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.
80
+
81
+ 2. **Add a `handler.py` file to define the endpoint logic.**
82
+
83
+ 3. **Push to the Hugging Face Hub.**
84
+
85
+ 4. **Deploy as an Inference Endpoint via the Hugging Face UI.**
86
+
87
+ ---
88
+
89
+ ## Example `handler.py`
90
+
91
+ This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.
92
+
93
+ ```python
94
+ from typing import Dict, Any
95
+ from transformers import AutoModelForCausalLM, AutoTokenizer
96
+ from peft import PeftModel, PeftConfig
97
+ import torch
98
+
99
+ class EndpointHandler:
100
+ def __init__(self, path="."):
101
+ # Load base model and tokenizer
102
+ base_model_id = "<BASE_MODEL_ID>" # e.g., "google/gemma-2b"
103
+ self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
104
+ base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
105
+ # Load LoRA adapter
106
+ self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
107
+ self.model.eval()
108
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
109
+ self.model.to(self.device)
110
+
111
+ def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
112
+ prompt = data["inputs"] if isinstance(data, dict) else data
113
+ inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
114
+ with torch.no_grad():
115
+ output = self.model.generate(**inputs, max_new_tokens=256)
116
+ decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
117
+ return {"generated_text": decoded}
118
+ ````
119
+
120
+ - Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
121
+ - The endpoint will accept a JSON payload with an `inputs` field containing the prompt.
122
+
123
+ ---
124
+
125
+ ## Notes
126
+
127
+ - Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
128
+ - For large models, use an Inference Endpoint with GPU.
129
+ - You can customize the handler for chat formatting, streaming, etc.
130
+
131
+ ---
132
+
133
+ ## Quickstart
134
+
135
+ 1. Train your adapter with `train_gemma_unsloth.py`.
136
+ 2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
137
+ 3. Deploy as an Inference Endpoint.
138
+ 4. Send requests to your endpoint!
backend_service.py CHANGED
@@ -1,9 +1,15 @@
 
 
 
 
 
 
 
 
1
  """
2
  FastAPI Backend AI Service using Gemma-3n-E4B-it-GGUF
3
  Provides OpenAI-compatible chat completion endpoints powered by unsloth/gemma-3n-E4B-it-GGUF
4
  """
5
-
6
- import os
7
  import warnings
8
 
9
  # Suppress warnings before any other imports
@@ -31,14 +37,7 @@ import uvicorn
31
  import requests
32
  from PIL import Image
33
 
34
- # Import llama-cpp-python for GGUF model support
35
- try:
36
- from llama_cpp import Llama
37
- llama_cpp_available = True
38
- logger = logging.getLogger(__name__)
39
- logger.info("✅ llama-cpp-python support available")
40
- except ImportError:
41
- llama_cpp_available = False
42
 
43
  # Keep transformers imports as fallback
44
  from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -51,14 +50,7 @@ import torch
51
  logging.basicConfig(level=logging.INFO)
52
  logger = logging.getLogger(__name__)
53
 
54
- # Check for optional quantization support
55
- try:
56
- import bitsandbytes as bnb
57
- quantization_available = True
58
- logger.info("✅ BitsAndBytes quantization support available")
59
- except ImportError:
60
- quantization_available = False
61
- logger.warning("⚠️ BitsAndBytes not available - 4-bit models will use standard loading")
62
 
63
  # Pydantic models for multimodal content
64
  class TextContent(BaseModel):
@@ -143,41 +135,17 @@ class CompletionRequest(BaseModel):
143
  temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
144
 
145
 
146
- # Global variables for model management (supporting both GGUF and transformers)
147
- # Model can be configured via environment variable - defaults to Gemma 3n GGUF
148
  current_model = os.environ.get("AI_MODEL", "unsloth/gemma-3n-E4B-it-GGUF")
149
  vision_model = os.environ.get("VISION_MODEL", "Salesforce/blip-image-captioning-base")
150
 
151
- # GGUF model support (llama-cpp-python)
152
- llm = None
153
-
154
- # Transformers model support (fallback)
155
  tokenizer = None
156
  model = None
157
  image_text_pipeline = None # type: ignore
158
 
159
- def get_quantization_config(model_name: str):
160
- """Get quantization config for 4-bit models"""
161
- if not quantization_available:
162
- return None
163
-
164
- # Check if this is a 4-bit model that should use quantization
165
- is_4bit_model = (
166
- "4bit" in model_name.lower() or
167
- "bnb" in model_name.lower() or
168
- "unsloth" in model_name.lower()
169
- )
170
-
171
- if is_4bit_model:
172
- logger.info(f"🔧 Configuring 4-bit quantization for {model_name}")
173
- return BitsAndBytesConfig(
174
- load_in_4bit=True,
175
- bnb_4bit_compute_dtype=torch.float16,
176
- bnb_4bit_quant_type="nf4",
177
- bnb_4bit_use_double_quant=True,
178
- )
179
-
180
- return None
181
 
182
  # Image processing utilities
183
  async def download_image(url: str) -> Image.Image:
@@ -222,135 +190,18 @@ def has_images(messages: List[ChatMessage]) -> bool:
222
  @asynccontextmanager
223
  async def lifespan(app: FastAPI):
224
  """Application lifespan manager for startup and shutdown events"""
225
- global tokenizer, model, image_text_pipeline, llm, current_model
226
- logger.info("🚀 Starting AI Backend Service...")
227
-
228
- # Check if this is a GGUF model that should use llama-cpp-python
229
- is_gguf_model = "gguf" in current_model.lower() or "gemma-3n" in current_model.lower()
230
-
231
  try:
232
- if is_gguf_model and llama_cpp_available:
233
- logger.info(f"📥 Loading GGUF model with llama-cpp-python: {current_model}")
234
-
235
- # Load Gemma 3n GGUF model using llama-cpp-python
236
- try:
237
- llm = Llama.from_pretrained(
238
- repo_id=current_model,
239
- filename="*Q4_K_M.gguf", # Use exact filename pattern from available files
240
- verbose=True,
241
- # Gemma 3n specific settings
242
- n_ctx=4096, # Start with 4K context, can be increased to 32K
243
- n_threads=4, # Adjust based on CPU cores
244
- n_gpu_layers=-1, # Use all GPU layers if CUDA available
245
- # Chat format for Gemma 3n
246
- chat_format="gemma", # Use built-in gemma format
247
- )
248
- logger.info("✅ Successfully loaded Gemma 3n GGUF model")
249
-
250
- except Exception as gguf_error:
251
- logger.warning(f"⚠️ GGUF model loading failed: {gguf_error}")
252
- logger.info("💡 Please ensure you have downloaded the GGUF model file locally")
253
- logger.info("💡 Visit: https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF")
254
-
255
- # For now, we'll continue with transformers fallback
256
- is_gguf_model = False
257
-
258
- # Fallback to transformers if GGUF loading failed or not available
259
- if not is_gguf_model or not llama_cpp_available:
260
- logger.info(f"📥 Loading model with transformers: {current_model}")
261
-
262
- # Load tokenizer and model directly from HuggingFace repo (standard transformers format)
263
- logger.info(f"📥 Loading tokenizer from {current_model}...")
264
- tokenizer = AutoTokenizer.from_pretrained(current_model)
265
-
266
- # Get quantization config if needed
267
- quantization_config = get_quantization_config(current_model)
268
-
269
- logger.info(f"📥 Loading model from {current_model}...")
270
- try:
271
- if quantization_config:
272
- logger.info("🔧 Attempting 4-bit quantization")
273
- model = AutoModelForCausalLM.from_pretrained(
274
- current_model,
275
- quantization_config=quantization_config,
276
- device_map="auto",
277
- torch_dtype=torch.bfloat16,
278
- low_cpu_mem_usage=True,
279
- trust_remote_code=True,
280
- )
281
- else:
282
- logger.info("📥 Using standard model loading with optimized settings")
283
- model = AutoModelForCausalLM.from_pretrained(
284
- current_model,
285
- torch_dtype=torch.bfloat16,
286
- device_map="auto",
287
- low_cpu_mem_usage=True,
288
- trust_remote_code=True,
289
- )
290
- except Exception as quant_error:
291
- if ("CUDA" in str(quant_error) or
292
- "bitsandbytes" in str(quant_error) or
293
- "PackageNotFoundError" in str(quant_error) or
294
- "No package metadata was found for bitsandbytes" in str(quant_error)):
295
-
296
- logger.warning(f"⚠️ Quantization failed - bitsandbytes not available or no CUDA: {quant_error}")
297
- logger.info("🔄 Falling back to standard model loading, ignoring pre-quantized config")
298
-
299
- # For pre-quantized models, we need to load config first and remove quantization
300
- try:
301
- logger.info("🔧 Loading model config to remove quantization settings")
302
-
303
- config = AutoConfig.from_pretrained(current_model, trust_remote_code=True)
304
-
305
- # Remove any quantization configuration from the config
306
- if hasattr(config, 'quantization_config'):
307
- logger.info("🚫 Removing quantization_config from model config")
308
- config.quantization_config = None
309
-
310
- model = AutoModelForCausalLM.from_pretrained(
311
- current_model,
312
- config=config,
313
- torch_dtype=torch.float16,
314
- low_cpu_mem_usage=True,
315
- trust_remote_code=True,
316
- device_map="cpu", # Force CPU when quantization fails
317
- )
318
- except Exception as fallback_error:
319
- logger.warning(f"⚠️ Config-based loading failed: {fallback_error}")
320
- logger.info("🔄 Trying standard loading without quantization config")
321
- try:
322
- model = AutoModelForCausalLM.from_pretrained(
323
- current_model,
324
- torch_dtype=torch.float16,
325
- low_cpu_mem_usage=True,
326
- trust_remote_code=True,
327
- device_map="cpu",
328
- )
329
- except Exception as standard_error:
330
- logger.warning(f"⚠️ Standard loading also failed: {standard_error}")
331
- logger.info("🔄 Trying with minimal configuration - bypassing all quantization")
332
- # Ultimate fallback: Load without any custom config
333
- try:
334
- model = AutoModelForCausalLM.from_pretrained(
335
- current_model,
336
- trust_remote_code=True,
337
- )
338
- except Exception as minimal_error:
339
- logger.warning(f"⚠️ Minimal loading also failed: {minimal_error}")
340
- logger.info("🔄 Final fallback: Using deployment-friendly default model")
341
- # If this specific model absolutely cannot load, fallback to a reliable alternative
342
- fallback_model = "microsoft/DialoGPT-medium"
343
- logger.info(f"📥 Loading fallback model: {fallback_model}")
344
- tokenizer = AutoTokenizer.from_pretrained(fallback_model)
345
- model = AutoModelForCausalLM.from_pretrained(fallback_model)
346
- logger.info(f"✅ Successfully loaded fallback model: {fallback_model}")
347
- # Update current_model to reflect what we actually loaded
348
- current_model = fallback_model
349
- else:
350
- raise quant_error
351
-
352
  logger.info(f"✅ Successfully loaded model and tokenizer: {current_model}")
353
-
354
  # Load image pipeline for multimodal support
355
  try:
356
  logger.info(f"🖼️ Initializing image captioning pipeline with model: {vision_model}")
@@ -359,7 +210,6 @@ async def lifespan(app: FastAPI):
359
  except Exception as e:
360
  logger.warning(f"⚠️ Could not load image captioning pipeline: {e}")
361
  image_text_pipeline = None
362
-
363
  except Exception as e:
364
  logger.error(f"❌ Failed to initialize model: {e}")
365
  raise RuntimeError(f"Service initialization failed: {e}")
@@ -388,9 +238,9 @@ app.add_middleware(
388
 
389
 
390
  def ensure_model_ready():
391
- """Check if either GGUF or transformers model is loaded and ready"""
392
- if llm is None and (tokenizer is None or model is None):
393
- raise HTTPException(status_code=503, detail="Service not ready - no model initialized (neither GGUF nor transformers)")
394
 
395
  def convert_messages_to_prompt(messages: List[ChatMessage]) -> str:
396
  """Convert OpenAI messages format to a single prompt string"""
@@ -482,61 +332,16 @@ async def generate_multimodal_response(
482
 
483
 
484
  def generate_response_local(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
485
- """Generate response using local model (GGUF or transformers) with chat template."""
486
  ensure_model_ready()
487
-
488
  try:
489
- # Check if we're using GGUF model (llama-cpp-python)
490
- if llm is not None:
491
- logger.info("🦾 Generating response using Gemma 3n GGUF model")
492
- return generate_response_gguf(messages, max_tokens, temperature, top_p)
493
-
494
- # Fallback to transformers model
495
- logger.info("🤗 Generating response using transformers model")
496
  return generate_response_transformers(messages, max_tokens, temperature, top_p)
497
-
498
  except Exception as e:
499
  logger.error(f"Local generation failed: {e}")
500
  return "I apologize, but I'm having trouble generating a response right now. Please try again."
501
 
502
- def generate_response_gguf(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
503
- """Generate response using GGUF model via llama-cpp-python."""
504
- try:
505
- # Use the chat completion method if available
506
- if hasattr(llm, 'create_chat_completion'):
507
- # Convert to dict format for llama-cpp-python
508
- messages_dict = [{"role": msg.role, "content": msg.content} for msg in messages]
509
-
510
- response = llm.create_chat_completion(
511
- messages=messages_dict,
512
- max_tokens=max_tokens,
513
- temperature=temperature,
514
- top_p=top_p,
515
- top_k=64, # Add top_k for better Gemma 3n performance
516
- stop=["<end_of_turn>", "<eos>", "</s>"] # Gemma 3n stop tokens
517
- )
518
-
519
- return response['choices'][0]['message']['content'].strip()
520
-
521
- else:
522
- # Fallback to direct prompt completion
523
- prompt = convert_messages_to_gemma_prompt(messages)
524
-
525
- response = llm(
526
- prompt,
527
- max_tokens=max_tokens,
528
- temperature=temperature,
529
- top_p=top_p,
530
- top_k=64,
531
- stop=["<end_of_turn>", "<eos>", "</s>"],
532
- echo=False
533
- )
534
-
535
- return response['choices'][0]['text'].strip()
536
-
537
- except Exception as e:
538
- logger.error(f"GGUF generation failed: {e}")
539
- return "I apologize, but I'm having trouble generating a response right now. Please try again."
540
 
541
  def convert_messages_to_gemma_prompt(messages: List[ChatMessage]) -> str:
542
  """Convert OpenAI messages format to Gemma 3n chat format."""
@@ -568,7 +373,7 @@ def generate_response_transformers(messages: List[ChatMessage], max_tokens: int
568
  content_str = m.content if isinstance(m.content, str) else extract_text_and_images(m.content)[0]
569
  chat_messages.append({"role": m.role, "content": content_str})
570
 
571
- # Apply chat template exactly as in HuggingFace example
572
  inputs = tokenizer.apply_chat_template(
573
  chat_messages,
574
  add_generation_prompt=True,
@@ -576,13 +381,12 @@ def generate_response_transformers(messages: List[ChatMessage], max_tokens: int
576
  return_dict=True,
577
  return_tensors="pt",
578
  )
579
-
580
- # Move inputs to model device
581
- inputs = inputs.to(model.device)
582
-
583
- # Generate response exactly as in HuggingFace example
584
- outputs = model.generate(**inputs, max_new_tokens=max_tokens)
585
-
586
  # Decode only the newly generated tokens (exclude input)
587
  generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
588
  return generated_text.strip()
@@ -644,11 +448,12 @@ async def list_models():
644
  # ...existing code...
645
 
646
 
 
 
 
647
  @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
648
- async def create_chat_completion(
649
- request: ChatCompletionRequest
650
- ) -> ChatCompletionResponse:
651
- """Create a chat completion (OpenAI-compatible) with multimodal support."""
652
  try:
653
  if not request.messages:
654
  raise HTTPException(status_code=400, detail="Messages cannot be empty")
 
1
+
2
+ from dotenv import load_dotenv
3
+ load_dotenv()
4
+ import os
5
+ import httpx
6
+
7
+ # Hugging Face Spaces: Only transformers backend is supported (no vLLM, no llama-cpp/gguf)
8
+
9
  """
10
  FastAPI Backend AI Service using Gemma-3n-E4B-it-GGUF
11
  Provides OpenAI-compatible chat completion endpoints powered by unsloth/gemma-3n-E4B-it-GGUF
12
  """
 
 
13
  import warnings
14
 
15
  # Suppress warnings before any other imports
 
37
  import requests
38
  from PIL import Image
39
 
40
+
 
 
 
 
 
 
 
41
 
42
  # Keep transformers imports as fallback
43
  from transformers import AutoTokenizer, AutoModelForCausalLM
 
50
  logging.basicConfig(level=logging.INFO)
51
  logger = logging.getLogger(__name__)
52
 
53
+
 
 
 
 
 
 
 
54
 
55
  # Pydantic models for multimodal content
56
  class TextContent(BaseModel):
 
135
  temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
136
 
137
 
138
+
139
+ # Model can be configured via environment variable - defaults to Gemma 3n (transformers format)
140
  current_model = os.environ.get("AI_MODEL", "unsloth/gemma-3n-E4B-it-GGUF")
141
  vision_model = os.environ.get("VISION_MODEL", "Salesforce/blip-image-captioning-base")
142
 
143
+ # Transformers model support
 
 
 
144
  tokenizer = None
145
  model = None
146
  image_text_pipeline = None # type: ignore
147
 
148
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
  # Image processing utilities
151
  async def download_image(url: str) -> Image.Image:
 
190
  @asynccontextmanager
191
  async def lifespan(app: FastAPI):
192
  """Application lifespan manager for startup and shutdown events"""
193
+ global tokenizer, model, image_text_pipeline, current_model
194
+ logger.info("🚀 Starting AI Backend Service (Hugging Face Spaces mode)...")
 
 
 
 
195
  try:
196
+ logger.info(f"📥 Loading model with transformers: {current_model}")
197
+ tokenizer = AutoTokenizer.from_pretrained(current_model)
198
+ # Hugging Face Spaces: Remove device_map and torch_dtype for CPU compatibility
199
+ model = AutoModelForCausalLM.from_pretrained(
200
+ current_model,
201
+ low_cpu_mem_usage=True,
202
+ trust_remote_code=True,
203
+ )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  logger.info(f"✅ Successfully loaded model and tokenizer: {current_model}")
 
205
  # Load image pipeline for multimodal support
206
  try:
207
  logger.info(f"🖼️ Initializing image captioning pipeline with model: {vision_model}")
 
210
  except Exception as e:
211
  logger.warning(f"⚠️ Could not load image captioning pipeline: {e}")
212
  image_text_pipeline = None
 
213
  except Exception as e:
214
  logger.error(f"❌ Failed to initialize model: {e}")
215
  raise RuntimeError(f"Service initialization failed: {e}")
 
238
 
239
 
240
  def ensure_model_ready():
241
+ """Check if transformers model is loaded and ready"""
242
+ if tokenizer is None or model is None:
243
+ raise HTTPException(status_code=503, detail="Service not ready - no model initialized (transformers)")
244
 
245
  def convert_messages_to_prompt(messages: List[ChatMessage]) -> str:
246
  """Convert OpenAI messages format to a single prompt string"""
 
332
 
333
 
334
  def generate_response_local(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
335
+ """Generate response using local transformers model with chat template."""
336
  ensure_model_ready()
 
337
  try:
338
+ logger.info(" Generating response using transformers model")
 
 
 
 
 
 
339
  return generate_response_transformers(messages, max_tokens, temperature, top_p)
 
340
  except Exception as e:
341
  logger.error(f"Local generation failed: {e}")
342
  return "I apologize, but I'm having trouble generating a response right now. Please try again."
343
 
344
+ ## GGUF/llama-cpp support removed for Hugging Face Spaces
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
345
 
346
  def convert_messages_to_gemma_prompt(messages: List[ChatMessage]) -> str:
347
  """Convert OpenAI messages format to Gemma 3n chat format."""
 
373
  content_str = m.content if isinstance(m.content, str) else extract_text_and_images(m.content)[0]
374
  chat_messages.append({"role": m.role, "content": content_str})
375
 
376
+ # Apply chat template and tokenize for Hugging Face Spaces CPU
377
  inputs = tokenizer.apply_chat_template(
378
  chat_messages,
379
  add_generation_prompt=True,
 
381
  return_dict=True,
382
  return_tensors="pt",
383
  )
384
+ # Pass input_ids and attention_mask directly (no .to(model.device))
385
+ outputs = model.generate(
386
+ input_ids=inputs["input_ids"],
387
+ attention_mask=inputs.get("attention_mask"),
388
+ max_new_tokens=max_tokens
389
+ )
 
390
  # Decode only the newly generated tokens (exclude input)
391
  generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
392
  return generated_text.strip()
 
448
  # ...existing code...
449
 
450
 
451
+
452
+
453
+ # --- Hugging Face Spaces: Only transformers backend supported ---
454
  @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
455
+ async def create_chat_completion(request: ChatCompletionRequest) -> ChatCompletionResponse:
456
+ """Create a chat completion (OpenAI-compatible) with multimodal support. Hugging Face Spaces: Only transformers backend supported."""
 
 
457
  try:
458
  if not request.messages:
459
  raise HTTPException(status_code=400, detail="Messages cannot be empty")
gemma_gguf_backend.py CHANGED
@@ -1,3 +1,4 @@
 
1
  #!/usr/bin/env python3
2
  """
3
  Working Gemma 3n GGUF Backend Service
 
1
+
2
  #!/usr/bin/env python3
3
  """
4
  Working Gemma 3n GGUF Backend Service
launch_vllm.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # (Removed for Hugging Face Spaces)
2
+ #!/usr/bin/env python3
3
+ """
4
+ Launch vLLM OpenAI-compatible server for google/gemma-3n-E4B-it in venv.
5
+ """
6
+ from dotenv import load_dotenv
7
+ load_dotenv()
8
+ import os
9
+ import subprocess
10
+ import sys
11
+
12
+ MODEL = "google/gemma-3n-E4B-it"
13
+ PORT = os.environ.get("VLLM_PORT", "8000")
14
+ HF_TOKEN = os.environ.get("HF_TOKEN") # User must set this for gated models
15
+
16
+ if not HF_TOKEN:
17
+ print("[ERROR] Please set the HF_TOKEN environment variable for model download.")
18
+ sys.exit(1)
19
+
20
+ cmd = [
21
+ sys.executable, "-m", "vllm.entrypoints.openai.api_server",
22
+ "--model", MODEL,
23
+ "--port", PORT,
24
+ "--host", "0.0.0.0",
25
+ "--token", HF_TOKEN
26
+ ]
27
+
28
+ print(f"[INFO] Launching vLLM server for {MODEL} on port {PORT}...")
29
+ subprocess.run(cmd)
30
+ #!/usr/bin/env python3
31
+ """
32
+ Launch vLLM OpenAI-compatible server for google/gemma-3n-E4B-it in venv.
33
+ """
34
+ from dotenv import load_dotenv
35
+ load_dotenv()
36
+ import os
37
+ import subprocess
38
+ import sys
39
+
40
+ MODEL = "google/gemma-3n-E4B-it"
41
+ PORT = os.environ.get("VLLM_PORT", "8000")
42
+ HF_TOKEN = os.environ.get("HF_TOKEN") # User must set this for gated models
43
+
44
+ if not HF_TOKEN:
45
+ print("[ERROR] Please set the HF_TOKEN environment variable for model download.")
46
+ sys.exit(1)
47
+
48
+ cmd = [
49
+ sys.executable, "-m", "vllm.entrypoints.openai.api_server",
50
+ "--model", MODEL,
51
+ "--port", PORT,
52
+ "--host", "0.0.0.0",
53
+ "--token", HF_TOKEN
54
+ ]
55
+
56
+ print(f"[INFO] Launching vLLM server for {MODEL} on port {PORT}...")
57
+ subprocess.run(cmd)
requirements.txt CHANGED
@@ -1,5 +1,13 @@
1
 
 
 
 
 
2
  transformers
3
- peft
4
  torch
5
- datasets
 
 
 
 
 
 
1
 
2
+
3
+ # Hugging Face Spaces requirements (transformers backend only)
4
+ fastapi
5
+ uvicorn
6
  transformers
 
7
  torch
8
+ python-dotenv
9
+ httpx
10
+ requests
11
+ Pillow
12
+ # Optional: gradio for demo UI
13
+ # gradio
space.yaml CHANGED
@@ -1,5 +1,3 @@
1
- sdk: fastapi
2
  python_version: 3.10
3
- app_file: gemma_gguf_backend.py
4
- env:
5
- - DEMO_MODE=0 # Ensure model loads properly in production
 
1
+ sdk: docker
2
  python_version: 3.10
3
+ entrypoint: python -m uvicorn backend_service:app --host 0.0.0.0 --port $PORT