Spaces:

cong182
/

firstAI

Sleeping

App Files Files Community

ndc8 commited on 6 days ago

Commit

65edee9

1 Parent(s): 7ecd130

aa

Browse files

Files changed (9) hide show

.github/instructions/recheck.instructions.md +90 -0
Dockerfile +0 -34
README.md +46 -0
README_DEPLOY_HF.md +70 -0
backend_service.py +42 -237
gemma_gguf_backend.py +1 -0
launch_vllm.py +57 -0
requirements.txt +10 -2
space.yaml +2 -4

.github/instructions/recheck.instructions.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+applyTo: "**"
+---
+# The QC Mindset: Architect of Trust
+At the highest level, Quality Control is not about finding defects; it's about **engineering confidence**. Your role is to guarantee a resilient system that protects business value, customer trust, and brand reputation. You are not only a gatekeeper who inspects products at the end of a line, but you are an architect who designs quality into the very foundation of the process.
+---
+# CMD The Three Pillars of High-Level QC Thinking
+Your strategic thinking should be built on three core pillars that elevate QC from a technical function to a business-critical one.
+---
+## 1. Think Like a Risk Manager, Not a Feature Tester
+Your primary concern isn't _"Does this button work?"_ but **"What is the business impact if this system fails?"**
+### Shift your focus from individual bugs to a portfolio of risks:
+- View every potential quality issue through an **economic lens**
+- Quantify failures in terms of:
+  - 💰 **Cost impact**
+  - 📉 **Customer churn potential**
+  - ⚖️ **Legal/regulatory exposure**
+  - 🔥 **Reputational damage**
+- Reframe quality discussions from **technical debates** into **strategic business decisions**
+- Position yourself as a **vital strategic partner to leadership**
+---
+## 2. Think Like a System Designer, Not an Inspector
+Your goal is **prevention, not detection**. A system that relies on end-stage inspection to catch errors is fundamentally broken.
+### Design a "Quality Immune System":
+- Analyze the **entire development lifecycle**
+- Identify **weak points where defects originate**
+- Build **feedback loops** and **automated checks**
+- Establish **cultural standards** that make defects hard to survive
+- Measure success by **defects prevented**, not **bugs found**
+> **Success Metric**: Fewer defects created = stronger quality architecture
+---
+## 3. Think Like a Governor, Not a Policeman
+Your authority comes from **objective, data-driven standards**, not subjective opinion. You cannot scale quality based on individual heroics or personal judgment.
+### Govern Through Standards:
+- Establish clear, **non-negotiable "Definition of Done"**
+- Create your **quality constitution** understood by all
+- Shift from **manual inspection** to **process auditing**
+- Focus on **analyzing quality data** and **improving standards**
+- Make quality **systemic, not situational**
+---
+# The Ultimate Litmus Test: The Legacy Question
+For any major process change, strategic decision, or new initiative, ask the ultimate high-level question:
+> **"If I left the company tomorrow, would the quality system I built continue to protect the business on its own?"**
+### If NO:
+- Quality still depends too heavily on **individuals**
+- System lacks **institutional resilience**
+- Standards need **greater automation and documentation**
+### If YES:
+- You've created **institutionalized quality**
+- Built **cultural and operational resilience**
+- Designed a system that **operates independently of any single person**
+---
+# Your Ultimate Mission
+> **Transform quality from a function performed by people into a system that performs for people.**
+Your ultimate goal is to make quality so inherent in the culture that the dedicated QC function can focus entirely on **strategic risk management** and **future challenges**, rather than inspecting daily deliverables.
+Create systems that **scale without you** — that's the mark of a true Quality Architect.

Dockerfile CHANGED Viewed

@@ -1,34 +0,0 @@
-FROM python:3.11-slim
-# Set working directory
-WORKDIR /app
-# Install system dependencies
-RUN apt-get update && apt-get install -y \
-    git \
-    curl \
-    build-essential \
-    cmake \
-    pkg-config \
-    python3-dev \
-    && rm -rf /var/lib/apt/lists/*
-# Copy requirements first for better caching
-COPY requirements.txt .
-# Install Python dependencies
-RUN pip install --no-cache-dir -r requirements.txt
-# Copy application code
-COPY . .
-# Expose port
-EXPOSE 8000
-# Set environment variables
-ENV PYTHONUNBUFFERED=1
-ENV HOST=0.0.0.0
-ENV PORT=8000
-# Run the application
-CMD ["python", "backend_service.py", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,3 +1,49 @@
 # Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
 This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

+# Hugging Face Spaces: FastAPI OpenAI-Compatible Backend
+This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf).
+## Features
+- OpenAI-compatible `/v1/chat/completions` endpoint
+- Multimodal support (text + image, if model supports)
+- Environment variable support via `.env`
+- Hugging Face Spaces compatible (CPU or T4/RTX GPU)
+## Usage (Local)
+```bash
+pip install -r requirements.txt
+python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860
+```
+## Usage (Hugging Face Spaces)
+- Push this repo to your Hugging Face Space
+- Space will auto-launch with FastAPI backend
+- Use `/v1/chat/completions` endpoint for OpenAI-compatible clients
+## Notes
+- Only transformers models are supported (no GGUF/llama-cpp, no vLLM)
+- Set your model in the `AI_MODEL` environment variable or edit `backend_service.py`
+- For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file
+## Example curl
+```bash
+curl -X POST https://<your-space>.hf.space/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+---
+For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker
+# Fallback Logic
+If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend.
 # Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
 This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

README_DEPLOY_HF.md CHANGED Viewed

@@ -66,3 +66,73 @@ class EndpointHandler:
 2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
 3. Deploy as an Inference Endpoint.
 4. Send requests to your endpoint!

 2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
 3. Deploy as an Inference Endpoint.
 4. Send requests to your endpoint!
+````
+# Hugging Face Inference Endpoint: Gemma-3n-E4B-it LoRA Adapter
+This repository provides a LoRA adapter fine-tuned on top of a Hugging Face Transformers model (e.g., Gemma-3n-E4B-it) using PEFT. It is ready to be deployed as a Hugging Face Inference Endpoint.
+## How to Deploy as an Endpoint
+1. **Upload the `adapter` directory (produced by training) to your Hugging Face Hub repository.**
+   - The directory should contain `adapter_config.json`, `adapter_model.bin`, and tokenizer files.
+2. **Add a `handler.py` file to define the endpoint logic.**
+3. **Push to the Hugging Face Hub.**
+4. **Deploy as an Inference Endpoint via the Hugging Face UI.**
+---
+## Example `handler.py`
+This file loads the base model and LoRA adapter, and exposes a `__call__` method for inference.
+```python
+from typing import Dict, Any
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel, PeftConfig
+import torch
+class EndpointHandler:
+    def __init__(self, path="."):
+        # Load base model and tokenizer
+        base_model_id = "<BASE_MODEL_ID>"  # e.g., "google/gemma-2b"
+        self.tokenizer = AutoTokenizer.from_pretrained(base_model_id, trust_remote_code=True)
+        base_model = AutoModelForCausalLM.from_pretrained(base_model_id, trust_remote_code=True)
+        # Load LoRA adapter
+        self.model = PeftModel.from_pretrained(base_model, f"{path}/adapter")
+        self.model.eval()
+        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+        self.model.to(self.device)
+    def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
+        prompt = data["inputs"] if isinstance(data, dict) else data
+        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
+        with torch.no_grad():
+            output = self.model.generate(**inputs, max_new_tokens=256)
+        decoded = self.tokenizer.decode(output[0], skip_special_tokens=True)
+        return {"generated_text": decoded}
+````
+- Replace `<BASE_MODEL_ID>` with the correct base model (e.g., `google/gemma-2b`).
+- The endpoint will accept a JSON payload with an `inputs` field containing the prompt.
+---
+## Notes
+- Make sure your `requirements.txt` includes `transformers`, `peft`, and `torch`.
+- For large models, use an Inference Endpoint with GPU.
+- You can customize the handler for chat formatting, streaming, etc.
+---
+## Quickstart
+1. Train your adapter with `train_gemma_unsloth.py`.
+2. Upload the `adapter` directory and `handler.py` to your Hugging Face repo.
+3. Deploy as an Inference Endpoint.
+4. Send requests to your endpoint!

backend_service.py CHANGED Viewed

@@ -1,9 +1,15 @@
 """
 FastAPI Backend AI Service using Gemma-3n-E4B-it-GGUF
 Provides OpenAI-compatible chat completion endpoints powered by unsloth/gemma-3n-E4B-it-GGUF
 """
-import os
 import warnings
 # Suppress warnings before any other imports
@@ -31,14 +37,7 @@ import uvicorn
 import requests
 from PIL import Image
-# Import llama-cpp-python for GGUF model support
-try:
-    from llama_cpp import Llama
-    llama_cpp_available = True
-    logger = logging.getLogger(__name__)
-    logger.info("✅ llama-cpp-python support available")
-except ImportError:
-    llama_cpp_available = False
 # Keep transformers imports as fallback
 from transformers import AutoTokenizer, AutoModelForCausalLM
@@ -51,14 +50,7 @@ import torch
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-# Check for optional quantization support
-try:
-    import bitsandbytes as bnb
-    quantization_available = True
-    logger.info("✅ BitsAndBytes quantization support available")
-except ImportError:
-    quantization_available = False
-    logger.warning("⚠️ BitsAndBytes not available - 4-bit models will use standard loading")
 # Pydantic models for multimodal content
 class TextContent(BaseModel):
@@ -143,41 +135,17 @@ class CompletionRequest(BaseModel):
     temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
-# Global variables for model management (supporting both GGUF and transformers)
-# Model can be configured via environment variable - defaults to Gemma 3n GGUF
 current_model = os.environ.get("AI_MODEL", "unsloth/gemma-3n-E4B-it-GGUF")
 vision_model = os.environ.get("VISION_MODEL", "Salesforce/blip-image-captioning-base")
-# GGUF model support (llama-cpp-python)
-llm = None
-# Transformers model support (fallback)
 tokenizer = None
 model = None
 image_text_pipeline = None  # type: ignore
-def get_quantization_config(model_name: str):
-    """Get quantization config for 4-bit models"""
-    if not quantization_available:
-        return None
-    # Check if this is a 4-bit model that should use quantization
-    is_4bit_model = (
-        "4bit" in model_name.lower() or
-        "bnb" in model_name.lower() or
-        "unsloth" in model_name.lower()
-    )
-    if is_4bit_model:
-        logger.info(f"🔧 Configuring 4-bit quantization for {model_name}")
-        return BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_compute_dtype=torch.float16,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_use_double_quant=True,
-        )
-    return None
 # Image processing utilities
 async def download_image(url: str) -> Image.Image:
@@ -222,135 +190,18 @@ def has_images(messages: List[ChatMessage]) -> bool:
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     """Application lifespan manager for startup and shutdown events"""
-    global tokenizer, model, image_text_pipeline, llm, current_model
-    logger.info("🚀 Starting AI Backend Service...")
-    # Check if this is a GGUF model that should use llama-cpp-python
-    is_gguf_model = "gguf" in current_model.lower() or "gemma-3n" in current_model.lower()
     try:
-        if is_gguf_model and llama_cpp_available:
-            logger.info(f"📥 Loading GGUF model with llama-cpp-python: {current_model}")
-            # Load Gemma 3n GGUF model using llama-cpp-python
-            try:
-                llm = Llama.from_pretrained(
-                    repo_id=current_model,
-                    filename="*Q4_K_M.gguf",  # Use exact filename pattern from available files
-                    verbose=True,
-                    # Gemma 3n specific settings
-                    n_ctx=4096,  # Start with 4K context, can be increased to 32K
-                    n_threads=4,  # Adjust based on CPU cores
-                    n_gpu_layers=-1,  # Use all GPU layers if CUDA available
-                    # Chat format for Gemma 3n
-                    chat_format="gemma",  # Use built-in gemma format
-                )
-                logger.info("✅ Successfully loaded Gemma 3n GGUF model")
-            except Exception as gguf_error:
-                logger.warning(f"⚠️ GGUF model loading failed: {gguf_error}")
-                logger.info("💡 Please ensure you have downloaded the GGUF model file locally")
-                logger.info("💡 Visit: https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF")
-                # For now, we'll continue with transformers fallback
-                is_gguf_model = False
-        # Fallback to transformers if GGUF loading failed or not available
-        if not is_gguf_model or not llama_cpp_available:
-            logger.info(f"📥 Loading model with transformers: {current_model}")
-            # Load tokenizer and model directly from HuggingFace repo (standard transformers format)
-            logger.info(f"📥 Loading tokenizer from {current_model}...")
-            tokenizer = AutoTokenizer.from_pretrained(current_model)
-            # Get quantization config if needed
-            quantization_config = get_quantization_config(current_model)
-            logger.info(f"📥 Loading model from {current_model}...")
-            try:
-                if quantization_config:
-                    logger.info("🔧 Attempting 4-bit quantization")
-                    model = AutoModelForCausalLM.from_pretrained(
-                        current_model,
-                        quantization_config=quantization_config,
-                        device_map="auto",
-                        torch_dtype=torch.bfloat16,
-                        low_cpu_mem_usage=True,
-                        trust_remote_code=True,
-                    )
-                else:
-                    logger.info("📥 Using standard model loading with optimized settings")
-                    model = AutoModelForCausalLM.from_pretrained(
-                        current_model,
-                        torch_dtype=torch.bfloat16,
-                        device_map="auto",
-                        low_cpu_mem_usage=True,
-                        trust_remote_code=True,
-                    )
-            except Exception as quant_error:
-                if ("CUDA" in str(quant_error) or
-                    "bitsandbytes" in str(quant_error) or
-                    "PackageNotFoundError" in str(quant_error) or
-                    "No package metadata was found for bitsandbytes" in str(quant_error)):
-                    logger.warning(f"⚠️ Quantization failed - bitsandbytes not available or no CUDA: {quant_error}")
-                    logger.info("🔄 Falling back to standard model loading, ignoring pre-quantized config")
-                    # For pre-quantized models, we need to load config first and remove quantization
-                    try:
-                        logger.info("🔧 Loading model config to remove quantization settings")
-                        config = AutoConfig.from_pretrained(current_model, trust_remote_code=True)
-                        # Remove any quantization configuration from the config
-                        if hasattr(config, 'quantization_config'):
-                            logger.info("🚫 Removing quantization_config from model config")
-                            config.quantization_config = None
-                        model = AutoModelForCausalLM.from_pretrained(
-                            current_model,
-                            config=config,
-                            torch_dtype=torch.float16,
-                            low_cpu_mem_usage=True,
-                            trust_remote_code=True,
-                            device_map="cpu",  # Force CPU when quantization fails
-                        )
-                    except Exception as fallback_error:
-                        logger.warning(f"⚠️ Config-based loading failed: {fallback_error}")
-                        logger.info("🔄 Trying standard loading without quantization config")
-                        try:
-                            model = AutoModelForCausalLM.from_pretrained(
-                                current_model,
-                                torch_dtype=torch.float16,
-                                low_cpu_mem_usage=True,
-                                trust_remote_code=True,
-                                device_map="cpu",
-                            )
-                        except Exception as standard_error:
-                            logger.warning(f"⚠️ Standard loading also failed: {standard_error}")
-                            logger.info("🔄 Trying with minimal configuration - bypassing all quantization")
-                            # Ultimate fallback: Load without any custom config
-                            try:
-                                model = AutoModelForCausalLM.from_pretrained(
-                                    current_model,
-                                    trust_remote_code=True,
-                                )
-                            except Exception as minimal_error:
-                                logger.warning(f"⚠️ Minimal loading also failed: {minimal_error}")
-                                logger.info("🔄 Final fallback: Using deployment-friendly default model")
-                                # If this specific model absolutely cannot load, fallback to a reliable alternative
-                                fallback_model = "microsoft/DialoGPT-medium"
-                                logger.info(f"📥 Loading fallback model: {fallback_model}")
-                                tokenizer = AutoTokenizer.from_pretrained(fallback_model)
-                                model = AutoModelForCausalLM.from_pretrained(fallback_model)
-                                logger.info(f"✅ Successfully loaded fallback model: {fallback_model}")
-                                # Update current_model to reflect what we actually loaded
-                                current_model = fallback_model
-                else:
-                    raise quant_error
         logger.info(f"✅ Successfully loaded model and tokenizer: {current_model}")
         # Load image pipeline for multimodal support
         try:
             logger.info(f"🖼️ Initializing image captioning pipeline with model: {vision_model}")
@@ -359,7 +210,6 @@ async def lifespan(app: FastAPI):
         except Exception as e:
             logger.warning(f"⚠️ Could not load image captioning pipeline: {e}")
             image_text_pipeline = None
     except Exception as e:
         logger.error(f"❌ Failed to initialize model: {e}")
         raise RuntimeError(f"Service initialization failed: {e}")
@@ -388,9 +238,9 @@ app.add_middleware(
 def ensure_model_ready():
-    """Check if either GGUF or transformers model is loaded and ready"""
-    if llm is None and (tokenizer is None or model is None):
-        raise HTTPException(status_code=503, detail="Service not ready - no model initialized (neither GGUF nor transformers)")
 def convert_messages_to_prompt(messages: List[ChatMessage]) -> str:
     """Convert OpenAI messages format to a single prompt string"""
@@ -482,61 +332,16 @@ async def generate_multimodal_response(
 def generate_response_local(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
-    """Generate response using local model (GGUF or transformers) with chat template."""
     ensure_model_ready()
     try:
-        # Check if we're using GGUF model (llama-cpp-python)
-        if llm is not None:
-            logger.info("🦾 Generating response using Gemma 3n GGUF model")
-            return generate_response_gguf(messages, max_tokens, temperature, top_p)
-        # Fallback to transformers model
-        logger.info("🤗 Generating response using transformers model")
         return generate_response_transformers(messages, max_tokens, temperature, top_p)
     except Exception as e:
         logger.error(f"Local generation failed: {e}")
         return "I apologize, but I'm having trouble generating a response right now. Please try again."
-def generate_response_gguf(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
-    """Generate response using GGUF model via llama-cpp-python."""
-    try:
-        # Use the chat completion method if available
-        if hasattr(llm, 'create_chat_completion'):
-            # Convert to dict format for llama-cpp-python
-            messages_dict = [{"role": msg.role, "content": msg.content} for msg in messages]
-            response = llm.create_chat_completion(
-                messages=messages_dict,
-                max_tokens=max_tokens,
-                temperature=temperature,
-                top_p=top_p,
-                top_k=64,  # Add top_k for better Gemma 3n performance
-                stop=["<end_of_turn>", "<eos>", "</s>"]  # Gemma 3n stop tokens
-            )
-            return response['choices'][0]['message']['content'].strip()
-        else:
-            # Fallback to direct prompt completion
-            prompt = convert_messages_to_gemma_prompt(messages)
-            response = llm(
-                prompt,
-                max_tokens=max_tokens,
-                temperature=temperature,
-                top_p=top_p,
-                top_k=64,
-                stop=["<end_of_turn>", "<eos>", "</s>"],
-                echo=False
-            )
-            return response['choices'][0]['text'].strip()
-    except Exception as e:
-        logger.error(f"GGUF generation failed: {e}")
-        return "I apologize, but I'm having trouble generating a response right now. Please try again."
 def convert_messages_to_gemma_prompt(messages: List[ChatMessage]) -> str:
     """Convert OpenAI messages format to Gemma 3n chat format."""
@@ -568,7 +373,7 @@ def generate_response_transformers(messages: List[ChatMessage], max_tokens: int
             content_str = m.content if isinstance(m.content, str) else extract_text_and_images(m.content)[0]
             chat_messages.append({"role": m.role, "content": content_str})
-        # Apply chat template exactly as in HuggingFace example
         inputs = tokenizer.apply_chat_template(
             chat_messages,
             add_generation_prompt=True,
@@ -576,13 +381,12 @@ def generate_response_transformers(messages: List[ChatMessage], max_tokens: int
             return_dict=True,
             return_tensors="pt",
         )
-        # Move inputs to model device
-        inputs = inputs.to(model.device)
-        # Generate response exactly as in HuggingFace example
-        outputs = model.generate(**inputs, max_new_tokens=max_tokens)
         # Decode only the newly generated tokens (exclude input)
         generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
         return generated_text.strip()
@@ -644,11 +448,12 @@ async def list_models():
         # ...existing code...
 @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
-async def create_chat_completion(
-    request: ChatCompletionRequest
-) -> ChatCompletionResponse:
-    """Create a chat completion (OpenAI-compatible) with multimodal support."""
     try:
         if not request.messages:
             raise HTTPException(status_code=400, detail="Messages cannot be empty")

+from dotenv import load_dotenv
+load_dotenv()
+import os
+import httpx
+# Hugging Face Spaces: Only transformers backend is supported (no vLLM, no llama-cpp/gguf)
 """
 FastAPI Backend AI Service using Gemma-3n-E4B-it-GGUF
 Provides OpenAI-compatible chat completion endpoints powered by unsloth/gemma-3n-E4B-it-GGUF
 """
 import warnings
 # Suppress warnings before any other imports
 import requests
 from PIL import Image
 # Keep transformers imports as fallback
 from transformers import AutoTokenizer, AutoModelForCausalLM
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 # Pydantic models for multimodal content
 class TextContent(BaseModel):
     temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
+# Model can be configured via environment variable - defaults to Gemma 3n (transformers format)
 current_model = os.environ.get("AI_MODEL", "unsloth/gemma-3n-E4B-it-GGUF")
 vision_model = os.environ.get("VISION_MODEL", "Salesforce/blip-image-captioning-base")
+# Transformers model support
 tokenizer = None
 model = None
 image_text_pipeline = None  # type: ignore
 # Image processing utilities
 async def download_image(url: str) -> Image.Image:
 @asynccontextmanager
 async def lifespan(app: FastAPI):
     """Application lifespan manager for startup and shutdown events"""
+    global tokenizer, model, image_text_pipeline, current_model
+    logger.info("🚀 Starting AI Backend Service (Hugging Face Spaces mode)...")
     try:
+        logger.info(f"📥 Loading model with transformers: {current_model}")
+        tokenizer = AutoTokenizer.from_pretrained(current_model)
+        # Hugging Face Spaces: Remove device_map and torch_dtype for CPU compatibility
+        model = AutoModelForCausalLM.from_pretrained(
+            current_model,
+            low_cpu_mem_usage=True,
+            trust_remote_code=True,
+        )
         logger.info(f"✅ Successfully loaded model and tokenizer: {current_model}")
         # Load image pipeline for multimodal support
         try:
             logger.info(f"🖼️ Initializing image captioning pipeline with model: {vision_model}")
         except Exception as e:
             logger.warning(f"⚠️ Could not load image captioning pipeline: {e}")
             image_text_pipeline = None
     except Exception as e:
         logger.error(f"❌ Failed to initialize model: {e}")
         raise RuntimeError(f"Service initialization failed: {e}")
 def ensure_model_ready():
+    """Check if transformers model is loaded and ready"""
+    if tokenizer is None or model is None:
+        raise HTTPException(status_code=503, detail="Service not ready - no model initialized (transformers)")
 def convert_messages_to_prompt(messages: List[ChatMessage]) -> str:
     """Convert OpenAI messages format to a single prompt string"""
 def generate_response_local(messages: List[ChatMessage], max_tokens: int = 512, temperature: float = 0.7, top_p: float = 0.95) -> str:
+    """Generate response using local transformers model with chat template."""
     ensure_model_ready()
     try:
+        logger.info(" Generating response using transformers model")
         return generate_response_transformers(messages, max_tokens, temperature, top_p)
     except Exception as e:
         logger.error(f"Local generation failed: {e}")
         return "I apologize, but I'm having trouble generating a response right now. Please try again."
+## GGUF/llama-cpp support removed for Hugging Face Spaces
 def convert_messages_to_gemma_prompt(messages: List[ChatMessage]) -> str:
     """Convert OpenAI messages format to Gemma 3n chat format."""
             content_str = m.content if isinstance(m.content, str) else extract_text_and_images(m.content)[0]
             chat_messages.append({"role": m.role, "content": content_str})
+        # Apply chat template and tokenize for Hugging Face Spaces CPU
         inputs = tokenizer.apply_chat_template(
             chat_messages,
             add_generation_prompt=True,
             return_dict=True,
             return_tensors="pt",
         )
+        # Pass input_ids and attention_mask directly (no .to(model.device))
+        outputs = model.generate(
+            input_ids=inputs["input_ids"],
+            attention_mask=inputs.get("attention_mask"),
+            max_new_tokens=max_tokens
+        )
         # Decode only the newly generated tokens (exclude input)
         generated_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
         return generated_text.strip()
         # ...existing code...
+# --- Hugging Face Spaces: Only transformers backend supported ---
 @app.post("/v1/chat/completions", response_model=ChatCompletionResponse)
+async def create_chat_completion(request: ChatCompletionRequest) -> ChatCompletionResponse:
+    """Create a chat completion (OpenAI-compatible) with multimodal support. Hugging Face Spaces: Only transformers backend supported."""
     try:
         if not request.messages:
             raise HTTPException(status_code=400, detail="Messages cannot be empty")

gemma_gguf_backend.py CHANGED Viewed

@@ -1,3 +1,4 @@
 #!/usr/bin/env python3
 """
 Working Gemma 3n GGUF Backend Service

 #!/usr/bin/env python3
 """
 Working Gemma 3n GGUF Backend Service

launch_vllm.py ADDED Viewed

	@@ -0,0 +1,57 @@

+# (Removed for Hugging Face Spaces)
+#!/usr/bin/env python3
+"""
+Launch vLLM OpenAI-compatible server for google/gemma-3n-E4B-it in venv.
+"""
+from dotenv import load_dotenv
+load_dotenv()
+import os
+import subprocess
+import sys
+MODEL = "google/gemma-3n-E4B-it"
+PORT = os.environ.get("VLLM_PORT", "8000")
+HF_TOKEN = os.environ.get("HF_TOKEN")  # User must set this for gated models
+if not HF_TOKEN:
+    print("[ERROR] Please set the HF_TOKEN environment variable for model download.")
+    sys.exit(1)
+cmd = [
+    sys.executable, "-m", "vllm.entrypoints.openai.api_server",
+    "--model", MODEL,
+    "--port", PORT,
+    "--host", "0.0.0.0",
+    "--token", HF_TOKEN
+]
+print(f"[INFO] Launching vLLM server for {MODEL} on port {PORT}...")
+subprocess.run(cmd)
+#!/usr/bin/env python3
+"""
+Launch vLLM OpenAI-compatible server for google/gemma-3n-E4B-it in venv.
+"""
+from dotenv import load_dotenv
+load_dotenv()
+import os
+import subprocess
+import sys
+MODEL = "google/gemma-3n-E4B-it"
+PORT = os.environ.get("VLLM_PORT", "8000")
+HF_TOKEN = os.environ.get("HF_TOKEN")  # User must set this for gated models
+if not HF_TOKEN:
+    print("[ERROR] Please set the HF_TOKEN environment variable for model download.")
+    sys.exit(1)
+cmd = [
+    sys.executable, "-m", "vllm.entrypoints.openai.api_server",
+    "--model", MODEL,
+    "--port", PORT,
+    "--host", "0.0.0.0",
+    "--token", HF_TOKEN
+]
+print(f"[INFO] Launching vLLM server for {MODEL} on port {PORT}...")
+subprocess.run(cmd)

requirements.txt CHANGED Viewed

@@ -1,5 +1,13 @@
 transformers
-peft
 torch
-datasets

+# Hugging Face Spaces requirements (transformers backend only)
+fastapi
+uvicorn
 transformers
 torch
+python-dotenv
+httpx
+requests
+Pillow
+# Optional: gradio for demo UI
+# gradio

space.yaml CHANGED Viewed

@@ -1,5 +1,3 @@
-sdk: fastapi
 python_version: 3.10
-app_file: gemma_gguf_backend.py
-env:
-  - DEMO_MODE=0 # Ensure model loads properly in production

+sdk: docker
 python_version: 3.10
+entrypoint: python -m uvicorn backend_service:app --host 0.0.0.0 --port $PORT