File size: 15,600 Bytes
6496777
ae13708
 
 
 
 
4f67c26
ae13708
6496777
ae13708
65edee9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91181f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b91eb5
91181f3
d3ad561
 
6b91eb5
 
b6cf19e
 
6b91eb5
91181f3
6b91eb5
 
d3ad561
 
cb5d5f8
d3ad561
 
 
cb5d5f8
d3ad561
cb5d5f8
 
 
 
d3ad561
 
 
 
 
 
 
 
cb5d5f8
d3ad561
cb5d5f8
 
 
d3ad561
 
 
 
cb5d5f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3ad561
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb5d5f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d3ad561
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78b611a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3960f0f
78b611a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
---
title: "OpenAI-Compatible FastAPI Backend"
emoji: "πŸ€–"
colorFrom: "blue"
colorTo: "green"
sdk: "docker"
app_port: 7860
pinned: false
---

# Hugging Face Spaces: FastAPI OpenAI-Compatible Backend

This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf).

## Features

- OpenAI-compatible `/v1/chat/completions` endpoint
- Multimodal support (text + image, if model supports)
- Environment variable support via `.env`
- Hugging Face Spaces compatible (CPU or T4/RTX GPU)

## Usage (Local)

```bash
pip install -r requirements.txt
python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860
```

## Usage (Hugging Face Spaces)

- Push this repo to your Hugging Face Space
- Space will auto-launch with FastAPI backend
- Use `/v1/chat/completions` endpoint for OpenAI-compatible clients

## Notes

- Only transformers models are supported (no GGUF/llama-cpp, no vLLM)
- Set your model in the `AI_MODEL` environment variable or edit `backend_service.py`
- For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file

## Example curl

```bash
curl -X POST https://<your-space>.hf.space/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}'
```

---

For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker

# Fallback Logic

If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend.

# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth

This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.

## Prerequisites

- Python 3.10+
- macOS with Apple Silicon (M1/M2/M3)
- PyTorch with MPS backend (install via `pip install torch`)
- All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`)

## Training Script Usage

Run the training script with your dataset (JSON/JSONL or Hugging Face format):

```bash
python training/train_gemma_unsloth.py \
  --job-id myjob \
  --output-dir training_runs/myjob \
  --dataset sample_data/train.jsonl \
  --prompt-field prompt --response-field response \
  --epochs 1 --batch-size 1 --gradient-accumulation 8 \
  --use-fp16 \
  --grpo --cpt \
  --export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl
```

**Flags:**

- `--grpo`: Enable GRPO (if supported by Unsloth)
- `--cpt`: Enable CPT (if supported by Unsloth)
- `--export-gguf`: Export to GGUF Q4_K_XL after training
- `--gguf-out`: Path to save GGUF file

**Notes:**

- On Mac, bitsandbytes/xformers are disabled automatically.
- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`.

## Troubleshooting

- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
- For memory errors, reduce `--batch-size` or `--cutoff-len`.
- For best results, use datasets formatted to match the official Gemma 3n chat template.

## Example: Manual GGUF Export with llama.cpp

If the script prints a message about manual conversion, run:

```bash
python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter
```

## References

- [Unsloth Documentation](https://unsloth.ai/)
- [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it)
- [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp)

---

title: Multimodal AI Backend Service
emoji: πŸš€
colorFrom: yellow
colorTo: purple
sdk: docker
app_port: 8000
pinned: false

---

# firstAI - Multimodal AI Backend πŸš€

A powerful AI backend service with **multimodal capabilities** and **advanced deployment support** - supporting both text generation and image analysis using transformers pipelines.

## πŸŽ‰ Features

### πŸ€– Configurable AI Models

- **Default Text Model**: Microsoft DialoGPT-medium (deployment-friendly)
- **Advanced Models**: Support for quantized models (Unsloth, 4-bit, GGUF)
- **Environment Configuration**: Runtime model selection via environment variables
- **Quantization Support**: Automatic 4-bit quantization with fallback mechanisms

### πŸ–ΌοΈ Multimodal Support

- Process text-only messages
- Analyze images from URLs
- Combined image + text conversations
- OpenAI Vision API compatible format

### οΏ½ Production Ready

- **Enhanced Deployment**: Multi-level fallback for quantized models
- **Environment Flexibility**: Works in constrained deployment environments
- **Error Resilience**: Comprehensive error handling with graceful degradation
- FastAPI backend with automatic docs
- Health checks and monitoring
- PyTorch with MPS acceleration (Apple Silicon)

### πŸ”§ Model Configuration

Configure models via environment variables:

```bash
# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"

# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"

# For private models (optional)
export HF_TOKEN="your_huggingface_token"
```

**Supported Model Types:**

- Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
- Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit`
- GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF`

## πŸš€ Quick Start

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Start the Service

```bash
python backend_service.py
```

### 3. Test Multimodal Capabilities

```bash
python test_final.py
```

The service will start on **http://localhost:8001** with both text and vision models loaded.

## πŸ’‘ Usage Examples

### Text-Only Chat

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'
```

### Image Analysis

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          }
        ]
      }
    ]
  }'
```

### Multimodal (Image + Text)

```bash
curl -X POST http://localhost:8001/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Salesforce/blip-image-captioning-base",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image",
            "url": "https://example.com/image.jpg"
          },
          {
            "type": "text",
            "text": "What do you see in this image?"
          }
        ]
      }
    ]
  }'
```

## πŸ”§ Technical Details

### Architecture

- **FastAPI** web framework
- **Transformers** pipeline for AI models
- **PyTorch** backend with GPU/MPS support
- **Pydantic** for request/response validation

### Models

- **Text**: microsoft/DialoGPT-medium
- **Vision**: Salesforce/blip-image-captioning-base

### API Endpoints

- `GET /` - Service information
- `GET /health` - Health check
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions (text/multimodal)
- `GET /docs` - Interactive API documentation

## πŸš€ Deployment

### Environment Variables

```bash
# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here"  # For private models
```

### Production Deployment

The service includes enhanced deployment capabilities:

- **Quantized Model Support**: Automatic handling of 4-bit and GGUF models
- **Fallback Mechanisms**: Multi-level fallback for constrained environments
- **Error Resilience**: Graceful degradation when quantization libraries unavailable

### Docker Deployment

```bash
# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai
```

### Testing Deployment

```bash
# Test quantization detection and fallbacks
python test_deployment_fallbacks.py

# Test health endpoint
curl http://localhost:8000/health
```

For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`.

## πŸ§ͺ Testing

Run the comprehensive test suite:

```bash
python test_final.py
```

Test individual components:

```bash
python test_multimodal.py  # Basic multimodal tests
python test_pipeline.py    # Pipeline compatibility
```

## πŸ“¦ Dependencies

Key packages:

- `fastapi` - Web framework
- `transformers` - AI model pipelines
- `torch` - PyTorch backend
- `Pillow` - Image processing
- `accelerate` - Model acceleration
- `requests` - HTTP client

## 🎯 Integration Complete

This project successfully integrates:
βœ… **Transformers image-text-to-text pipeline**  
βœ… **OpenAI Vision API compatibility**  
βœ… **Multimodal message processing**  
βœ… **Production-ready FastAPI service**

See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation.

- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service
  emoji: οΏ½
  colorFrom: yellow
  colorTo: purple
  sdk: fastapi
  sdk_version: 0.100.0
  app_file: backend_service.py
  pinned: false

---

# AI Backend Service πŸš€

**Status: βœ… CONVERSION COMPLETE!**

Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.

## Quick Start

### 1. Setup Environment

```bash
# Activate the virtual environment
source gradio_env/bin/activate

# Install dependencies (already done)
pip install -r requirements.txt
```

### 2. Start the Backend Service

```bash
python backend_service.py --port 8000 --reload
```

### 3. Test the API

```bash
# Run comprehensive tests
python test_api.py

# Or try usage examples
python usage_examples.py
```

## API Endpoints

| Endpoint               | Method | Description                         |
| ---------------------- | ------ | ----------------------------------- |
| `/`                    | GET    | Service information                 |
| `/health`              | GET    | Health check                        |
| `/v1/models`           | GET    | List available models               |
| `/v1/chat/completions` | POST   | Chat completion (OpenAI compatible) |
| `/v1/completions`      | POST   | Text completion                     |

## Example Usage

### Chat Completion

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "max_tokens": 150,
    "temperature": 0.7
  }'
```

### Streaming Chat

```bash
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/DialoGPT-medium",
    "messages": [
      {"role": "user", "content": "Tell me a joke"}
    ],
    "stream": true
  }'
```

## Files

- **`app.py`** - Original Gradio ChatInterface (still functional)
- **`backend_service.py`** - New FastAPI backend service ⭐
- **`test_api.py`** - Comprehensive API testing
- **`usage_examples.py`** - Simple usage examples
- **`requirements.txt`** - Updated dependencies
- **`CONVERSION_COMPLETE.md`** - Detailed conversion documentation

## Features

βœ… **OpenAI-Compatible API** - Drop-in replacement for OpenAI API  
βœ… **Async FastAPI** - High-performance async architecture  
βœ… **Streaming Support** - Real-time response streaming  
βœ… **Error Handling** - Robust error handling with fallbacks  
βœ… **Production Ready** - CORS, logging, health checks  
βœ… **Docker Ready** - Easy containerization  
βœ… **Auto-reload** - Development-friendly auto-reload  
βœ… **Type Safety** - Full type hints with Pydantic validation

## Service URLs

- **Backend Service**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs
- **OpenAPI Spec**: http://localhost:8000/openapi.json

## Model Information

- **Current Model**: `microsoft/DialoGPT-medium`
- **Type**: Conversational AI model
- **Provider**: HuggingFace Inference API
- **Capabilities**: Text generation, chat completion

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Client Request    │───▢│   FastAPI Backend    │───▢│  HuggingFace API    β”‚
β”‚  (OpenAI format)    β”‚    β”‚  (backend_service)   β”‚    β”‚  (DialoGPT-medium)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                                       β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚   OpenAI Response    β”‚
                           β”‚   (JSON/Streaming)   β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

## Development

The service includes:

- **Auto-reload** for development
- **Comprehensive logging** for debugging
- **Type checking** for code quality
- **Test suite** for reliability
- **Error handling** for robustness

## Production Deployment

Ready for production with:

- **Environment variables** for configuration
- **Health check endpoints** for monitoring
- **CORS support** for web applications
- **Docker compatibility** for containerization
- **Structured logging** for observability

---

**πŸŽ‰ Conversion Status: COMPLETE!**  
Successfully transformed from broken Gradio app to production-ready AI backend service.

For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md).

# Gemma 3n GGUF FastAPI Backend (Hugging Face Space)

This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.

**Note:** On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.

## Endpoints

- `/health` β€” Health check
- `/v1/chat/completions` β€” OpenAI-style chat completions (returns demo response)
- `/train/start` β€” Start a (demo) training job
- `/train/status/{job_id}` β€” Check training job status
- `/train/logs/{job_id}` β€” Get training logs

## Usage

1. **Clone this repo** or create a Hugging Face Space (type: FastAPI).
2. All dependencies are in `requirements.txt`.
3. The Space will start in demo mode (no model download required).

## Local Inference (with GGUF)

To run with a real model locally:

1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
2. Set `AI_MODEL` to the local path or repo.
3. Unset `DEMO_MODE`.
4. Run:
   ```bash
   pip install -r requirements.txt
   uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
   ```

## License

Apache 2.0