|
--- |
|
title: "OpenAI-Compatible FastAPI Backend" |
|
emoji: "π€" |
|
colorFrom: "blue" |
|
colorTo: "green" |
|
sdk: "docker" |
|
app_port: 7860 |
|
pinned: false |
|
--- |
|
|
|
# Hugging Face Spaces: FastAPI OpenAI-Compatible Backend |
|
|
|
This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf). |
|
|
|
## Features |
|
|
|
- OpenAI-compatible `/v1/chat/completions` endpoint |
|
- Multimodal support (text + image, if model supports) |
|
- Environment variable support via `.env` |
|
- Hugging Face Spaces compatible (CPU or T4/RTX GPU) |
|
|
|
## Usage (Local) |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860 |
|
``` |
|
|
|
## Usage (Hugging Face Spaces) |
|
|
|
- Push this repo to your Hugging Face Space |
|
- Space will auto-launch with FastAPI backend |
|
- Use `/v1/chat/completions` endpoint for OpenAI-compatible clients |
|
|
|
## Notes |
|
|
|
- Only transformers models are supported (no GGUF/llama-cpp, no vLLM) |
|
- Set your model in the `AI_MODEL` environment variable or edit `backend_service.py` |
|
- For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file |
|
|
|
## Example curl |
|
|
|
```bash |
|
curl -X POST https://<your-space>.hf.space/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}' |
|
``` |
|
|
|
--- |
|
|
|
For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker |
|
|
|
# Fallback Logic |
|
|
|
If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend. |
|
|
|
# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth |
|
|
|
This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies. |
|
|
|
## Prerequisites |
|
|
|
- Python 3.10+ |
|
- macOS with Apple Silicon (M1/M2/M3) |
|
- PyTorch with MPS backend (install via `pip install torch`) |
|
- All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`) |
|
|
|
## Training Script Usage |
|
|
|
Run the training script with your dataset (JSON/JSONL or Hugging Face format): |
|
|
|
```bash |
|
python training/train_gemma_unsloth.py \ |
|
--job-id myjob \ |
|
--output-dir training_runs/myjob \ |
|
--dataset sample_data/train.jsonl \ |
|
--prompt-field prompt --response-field response \ |
|
--epochs 1 --batch-size 1 --gradient-accumulation 8 \ |
|
--use-fp16 \ |
|
--grpo --cpt \ |
|
--export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl |
|
``` |
|
|
|
**Flags:** |
|
|
|
- `--grpo`: Enable GRPO (if supported by Unsloth) |
|
- `--cpt`: Enable CPT (if supported by Unsloth) |
|
- `--export-gguf`: Export to GGUF Q4_K_XL after training |
|
- `--gguf-out`: Path to save GGUF file |
|
|
|
**Notes:** |
|
|
|
- On Mac, bitsandbytes/xformers are disabled automatically. |
|
- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation. |
|
- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`. |
|
|
|
## Troubleshooting |
|
|
|
- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers. |
|
- For memory errors, reduce `--batch-size` or `--cutoff-len`. |
|
- For best results, use datasets formatted to match the official Gemma 3n chat template. |
|
|
|
## Example: Manual GGUF Export with llama.cpp |
|
|
|
If the script prints a message about manual conversion, run: |
|
|
|
```bash |
|
python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter |
|
``` |
|
|
|
## References |
|
|
|
- [Unsloth Documentation](https://unsloth.ai/) |
|
- [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it) |
|
- [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp) |
|
|
|
--- |
|
|
|
title: Multimodal AI Backend Service |
|
emoji: π |
|
colorFrom: yellow |
|
colorTo: purple |
|
sdk: docker |
|
app_port: 8000 |
|
pinned: false |
|
|
|
--- |
|
|
|
# firstAI - Multimodal AI Backend π |
|
|
|
A powerful AI backend service with **multimodal capabilities** and **advanced deployment support** - supporting both text generation and image analysis using transformers pipelines. |
|
|
|
## π Features |
|
|
|
### π€ Configurable AI Models |
|
|
|
- **Default Text Model**: Microsoft DialoGPT-medium (deployment-friendly) |
|
- **Advanced Models**: Support for quantized models (Unsloth, 4-bit, GGUF) |
|
- **Environment Configuration**: Runtime model selection via environment variables |
|
- **Quantization Support**: Automatic 4-bit quantization with fallback mechanisms |
|
|
|
### πΌοΈ Multimodal Support |
|
|
|
- Process text-only messages |
|
- Analyze images from URLs |
|
- Combined image + text conversations |
|
- OpenAI Vision API compatible format |
|
|
|
### οΏ½ Production Ready |
|
|
|
- **Enhanced Deployment**: Multi-level fallback for quantized models |
|
- **Environment Flexibility**: Works in constrained deployment environments |
|
- **Error Resilience**: Comprehensive error handling with graceful degradation |
|
- FastAPI backend with automatic docs |
|
- Health checks and monitoring |
|
- PyTorch with MPS acceleration (Apple Silicon) |
|
|
|
### π§ Model Configuration |
|
|
|
Configure models via environment variables: |
|
|
|
```bash |
|
# Set custom text model (optional) |
|
export AI_MODEL="microsoft/DialoGPT-medium" |
|
|
|
# Set custom vision model (optional) |
|
export VISION_MODEL="Salesforce/blip-image-captioning-base" |
|
|
|
# For private models (optional) |
|
export HF_TOKEN="your_huggingface_token" |
|
``` |
|
|
|
**Supported Model Types:** |
|
|
|
- Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B` |
|
- Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit` |
|
- GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF` |
|
|
|
## π Quick Start |
|
|
|
### 1. Install Dependencies |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### 2. Start the Service |
|
|
|
```bash |
|
python backend_service.py |
|
``` |
|
|
|
### 3. Test Multimodal Capabilities |
|
|
|
```bash |
|
python test_final.py |
|
``` |
|
|
|
The service will start on **http://localhost:8001** with both text and vision models loaded. |
|
|
|
## π‘ Usage Examples |
|
|
|
### Text-Only Chat |
|
|
|
```bash |
|
curl -X POST http://localhost:8001/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "microsoft/DialoGPT-medium", |
|
"messages": [{"role": "user", "content": "Hello!"}] |
|
}' |
|
``` |
|
|
|
### Image Analysis |
|
|
|
```bash |
|
curl -X POST http://localhost:8001/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "Salesforce/blip-image-captioning-base", |
|
"messages": [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
"url": "https://example.com/image.jpg" |
|
} |
|
] |
|
} |
|
] |
|
}' |
|
``` |
|
|
|
### Multimodal (Image + Text) |
|
|
|
```bash |
|
curl -X POST http://localhost:8001/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "Salesforce/blip-image-captioning-base", |
|
"messages": [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
"url": "https://example.com/image.jpg" |
|
}, |
|
{ |
|
"type": "text", |
|
"text": "What do you see in this image?" |
|
} |
|
] |
|
} |
|
] |
|
}' |
|
``` |
|
|
|
## π§ Technical Details |
|
|
|
### Architecture |
|
|
|
- **FastAPI** web framework |
|
- **Transformers** pipeline for AI models |
|
- **PyTorch** backend with GPU/MPS support |
|
- **Pydantic** for request/response validation |
|
|
|
### Models |
|
|
|
- **Text**: microsoft/DialoGPT-medium |
|
- **Vision**: Salesforce/blip-image-captioning-base |
|
|
|
### API Endpoints |
|
|
|
- `GET /` - Service information |
|
- `GET /health` - Health check |
|
- `GET /v1/models` - List available models |
|
- `POST /v1/chat/completions` - Chat completions (text/multimodal) |
|
- `GET /docs` - Interactive API documentation |
|
|
|
## π Deployment |
|
|
|
### Environment Variables |
|
|
|
```bash |
|
# Optional: Custom models |
|
export AI_MODEL="microsoft/DialoGPT-medium" |
|
export VISION_MODEL="Salesforce/blip-image-captioning-base" |
|
export HF_TOKEN="your_token_here" # For private models |
|
``` |
|
|
|
### Production Deployment |
|
|
|
The service includes enhanced deployment capabilities: |
|
|
|
- **Quantized Model Support**: Automatic handling of 4-bit and GGUF models |
|
- **Fallback Mechanisms**: Multi-level fallback for constrained environments |
|
- **Error Resilience**: Graceful degradation when quantization libraries unavailable |
|
|
|
### Docker Deployment |
|
|
|
```bash |
|
# Build and run with Docker |
|
docker build -t firstai . |
|
docker run -p 8000:8000 firstai |
|
``` |
|
|
|
### Testing Deployment |
|
|
|
```bash |
|
# Test quantization detection and fallbacks |
|
python test_deployment_fallbacks.py |
|
|
|
# Test health endpoint |
|
curl http://localhost:8000/health |
|
``` |
|
|
|
For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`. |
|
|
|
## π§ͺ Testing |
|
|
|
Run the comprehensive test suite: |
|
|
|
```bash |
|
python test_final.py |
|
``` |
|
|
|
Test individual components: |
|
|
|
```bash |
|
python test_multimodal.py # Basic multimodal tests |
|
python test_pipeline.py # Pipeline compatibility |
|
``` |
|
|
|
## π¦ Dependencies |
|
|
|
Key packages: |
|
|
|
- `fastapi` - Web framework |
|
- `transformers` - AI model pipelines |
|
- `torch` - PyTorch backend |
|
- `Pillow` - Image processing |
|
- `accelerate` - Model acceleration |
|
- `requests` - HTTP client |
|
|
|
## π― Integration Complete |
|
|
|
This project successfully integrates: |
|
β
**Transformers image-text-to-text pipeline** |
|
β
**OpenAI Vision API compatibility** |
|
β
**Multimodal message processing** |
|
β
**Production-ready FastAPI service** |
|
|
|
See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation. |
|
|
|
- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service |
|
emoji: οΏ½ |
|
colorFrom: yellow |
|
colorTo: purple |
|
sdk: fastapi |
|
sdk_version: 0.100.0 |
|
app_file: backend_service.py |
|
pinned: false |
|
|
|
--- |
|
|
|
# AI Backend Service π |
|
|
|
**Status: β
CONVERSION COMPLETE!** |
|
|
|
Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints. |
|
|
|
## Quick Start |
|
|
|
### 1. Setup Environment |
|
|
|
```bash |
|
# Activate the virtual environment |
|
source gradio_env/bin/activate |
|
|
|
# Install dependencies (already done) |
|
pip install -r requirements.txt |
|
``` |
|
|
|
### 2. Start the Backend Service |
|
|
|
```bash |
|
python backend_service.py --port 8000 --reload |
|
``` |
|
|
|
### 3. Test the API |
|
|
|
```bash |
|
# Run comprehensive tests |
|
python test_api.py |
|
|
|
# Or try usage examples |
|
python usage_examples.py |
|
``` |
|
|
|
## API Endpoints |
|
|
|
| Endpoint | Method | Description | |
|
| ---------------------- | ------ | ----------------------------------- | |
|
| `/` | GET | Service information | |
|
| `/health` | GET | Health check | |
|
| `/v1/models` | GET | List available models | |
|
| `/v1/chat/completions` | POST | Chat completion (OpenAI compatible) | |
|
| `/v1/completions` | POST | Text completion | |
|
|
|
## Example Usage |
|
|
|
### Chat Completion |
|
|
|
```bash |
|
curl -X POST http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "microsoft/DialoGPT-medium", |
|
"messages": [ |
|
{"role": "user", "content": "Hello! How are you?"} |
|
], |
|
"max_tokens": 150, |
|
"temperature": 0.7 |
|
}' |
|
``` |
|
|
|
### Streaming Chat |
|
|
|
```bash |
|
curl -X POST http://localhost:8000/v1/chat/completions \ |
|
-H "Content-Type: application/json" \ |
|
-d '{ |
|
"model": "microsoft/DialoGPT-medium", |
|
"messages": [ |
|
{"role": "user", "content": "Tell me a joke"} |
|
], |
|
"stream": true |
|
}' |
|
``` |
|
|
|
## Files |
|
|
|
- **`app.py`** - Original Gradio ChatInterface (still functional) |
|
- **`backend_service.py`** - New FastAPI backend service β |
|
- **`test_api.py`** - Comprehensive API testing |
|
- **`usage_examples.py`** - Simple usage examples |
|
- **`requirements.txt`** - Updated dependencies |
|
- **`CONVERSION_COMPLETE.md`** - Detailed conversion documentation |
|
|
|
## Features |
|
|
|
β
**OpenAI-Compatible API** - Drop-in replacement for OpenAI API |
|
β
**Async FastAPI** - High-performance async architecture |
|
β
**Streaming Support** - Real-time response streaming |
|
β
**Error Handling** - Robust error handling with fallbacks |
|
β
**Production Ready** - CORS, logging, health checks |
|
β
**Docker Ready** - Easy containerization |
|
β
**Auto-reload** - Development-friendly auto-reload |
|
β
**Type Safety** - Full type hints with Pydantic validation |
|
|
|
## Service URLs |
|
|
|
- **Backend Service**: http://localhost:8000 |
|
- **API Documentation**: http://localhost:8000/docs |
|
- **OpenAPI Spec**: http://localhost:8000/openapi.json |
|
|
|
## Model Information |
|
|
|
- **Current Model**: `microsoft/DialoGPT-medium` |
|
- **Type**: Conversational AI model |
|
- **Provider**: HuggingFace Inference API |
|
- **Capabilities**: Text generation, chat completion |
|
|
|
## Architecture |
|
|
|
``` |
|
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ |
|
β Client Request βββββΆβ FastAPI Backend βββββΆβ HuggingFace API β |
|
β (OpenAI format) β β (backend_service) β β (DialoGPT-medium) β |
|
βββββββββββββββββββββββ ββββββββββββββββββββββββ βββββββββββββββββββββββ |
|
β |
|
βΌ |
|
ββββββββββββββββββββββββ |
|
β OpenAI Response β |
|
β (JSON/Streaming) β |
|
ββββββββββββββββββββββββ |
|
``` |
|
|
|
## Development |
|
|
|
The service includes: |
|
|
|
- **Auto-reload** for development |
|
- **Comprehensive logging** for debugging |
|
- **Type checking** for code quality |
|
- **Test suite** for reliability |
|
- **Error handling** for robustness |
|
|
|
## Production Deployment |
|
|
|
Ready for production with: |
|
|
|
- **Environment variables** for configuration |
|
- **Health check endpoints** for monitoring |
|
- **CORS support** for web applications |
|
- **Docker compatibility** for containerization |
|
- **Structured logging** for observability |
|
|
|
--- |
|
|
|
**π Conversion Status: COMPLETE!** |
|
Successfully transformed from broken Gradio app to production-ready AI backend service. |
|
|
|
For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md). |
|
|
|
# Gemma 3n GGUF FastAPI Backend (Hugging Face Space) |
|
|
|
This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI. |
|
|
|
**Note:** On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python. |
|
|
|
## Endpoints |
|
|
|
- `/health` β Health check |
|
- `/v1/chat/completions` β OpenAI-style chat completions (returns demo response) |
|
- `/train/start` β Start a (demo) training job |
|
- `/train/status/{job_id}` β Check training job status |
|
- `/train/logs/{job_id}` β Get training logs |
|
|
|
## Usage |
|
|
|
1. **Clone this repo** or create a Hugging Face Space (type: FastAPI). |
|
2. All dependencies are in `requirements.txt`. |
|
3. The Space will start in demo mode (no model download required). |
|
|
|
## Local Inference (with GGUF) |
|
|
|
To run with a real model locally: |
|
|
|
1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF). |
|
2. Set `AI_MODEL` to the local path or repo. |
|
3. Unset `DEMO_MODE`. |
|
4. Run: |
|
```bash |
|
pip install -r requirements.txt |
|
uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000 |
|
``` |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|