firstAI / README.md
ndc8
c
4f67c26
---
title: "OpenAI-Compatible FastAPI Backend"
emoji: "πŸ€–"
colorFrom: "blue"
colorTo: "green"
sdk: "docker"
app_port: 7860
pinned: false
---
# Hugging Face Spaces: FastAPI OpenAI-Compatible Backend
This project is now ready to deploy as a Hugging Face Space using FastAPI and transformers (no vLLM, no llama-cpp/gguf).
## Features
- OpenAI-compatible `/v1/chat/completions` endpoint
- Multimodal support (text + image, if model supports)
- Environment variable support via `.env`
- Hugging Face Spaces compatible (CPU or T4/RTX GPU)
## Usage (Local)
```bash
pip install -r requirements.txt
python -m uvicorn backend_service:app --host 0.0.0.0 --port 7860
```
## Usage (Hugging Face Spaces)
- Push this repo to your Hugging Face Space
- Space will auto-launch with FastAPI backend
- Use `/v1/chat/completions` endpoint for OpenAI-compatible clients
## Notes
- Only transformers models are supported (no GGUF/llama-cpp, no vLLM)
- Set your model in the `AI_MODEL` environment variable or edit `backend_service.py`
- For secrets, use the Hugging Face Spaces Secrets UI or a `.env` file
## Example curl
```bash
curl -X POST https://<your-space>.hf.space/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3n-E4B-it", "messages": [{"role": "user", "content": "Hello!"}]}'
```
---
For more, see Hugging Face Spaces docs: https://huggingface.co/docs/hub/spaces-sdks-docker
# Fallback Logic
If vLLM fails to start or respond, the backend will automatically fallback to the legacy backend.
# Fine-tuning Gemma 3n E4B on MacBook M1 (Apple Silicon) with Unsloth
This project supports local fine-tuning of the Gemma 3n E4B model using Unsloth, PEFT/LoRA, and export to GGUF Q4_K_XL for efficient inference. The workflow is optimized for Apple Silicon (M1/M2/M3) and avoids CUDA/bitsandbytes dependencies.
## Prerequisites
- Python 3.10+
- macOS with Apple Silicon (M1/M2/M3)
- PyTorch with MPS backend (install via `pip install torch`)
- All dependencies in `requirements.txt` (install with `pip install -r requirements.txt`)
## Training Script Usage
Run the training script with your dataset (JSON/JSONL or Hugging Face format):
```bash
python training/train_gemma_unsloth.py \
--job-id myjob \
--output-dir training_runs/myjob \
--dataset sample_data/train.jsonl \
--prompt-field prompt --response-field response \
--epochs 1 --batch-size 1 --gradient-accumulation 8 \
--use-fp16 \
--grpo --cpt \
--export-gguf --gguf-out training_runs/myjob/adapter-gguf-q4_k_xl
```
**Flags:**
- `--grpo`: Enable GRPO (if supported by Unsloth)
- `--cpt`: Enable CPT (if supported by Unsloth)
- `--export-gguf`: Export to GGUF Q4_K_XL after training
- `--gguf-out`: Path to save GGUF file
**Notes:**
- On Mac, bitsandbytes/xformers are disabled automatically.
- Training is slower than on CUDA GPUs; use small batch sizes and gradient accumulation.
- If Unsloth's GGUF export is unavailable, follow the printed instructions to use llama.cpp's `convert-hf-to-gguf.py`.
## Troubleshooting
- If you see errors about missing CUDA or bitsandbytes, ensure you are running on Apple Silicon and have the latest Unsloth/Transformers.
- For memory errors, reduce `--batch-size` or `--cutoff-len`.
- For best results, use datasets formatted to match the official Gemma 3n chat template.
## Example: Manual GGUF Export with llama.cpp
If the script prints a message about manual conversion, run:
```bash
python convert-hf-to-gguf.py --outtype q4_k_xl --outfile training_runs/myjob/adapter-gguf-q4_k_xl training_runs/myjob/adapter
```
## References
- [Unsloth Documentation](https://unsloth.ai/)
- [Gemma 3n E4B Model Card](https://huggingface.co/unsloth/gemma-3n-E4B-it)
- [llama.cpp GGUF Export Guide](https://github.com/ggerganov/llama.cpp)
---
title: Multimodal AI Backend Service
emoji: πŸš€
colorFrom: yellow
colorTo: purple
sdk: docker
app_port: 8000
pinned: false
---
# firstAI - Multimodal AI Backend πŸš€
A powerful AI backend service with **multimodal capabilities** and **advanced deployment support** - supporting both text generation and image analysis using transformers pipelines.
## πŸŽ‰ Features
### πŸ€– Configurable AI Models
- **Default Text Model**: Microsoft DialoGPT-medium (deployment-friendly)
- **Advanced Models**: Support for quantized models (Unsloth, 4-bit, GGUF)
- **Environment Configuration**: Runtime model selection via environment variables
- **Quantization Support**: Automatic 4-bit quantization with fallback mechanisms
### πŸ–ΌοΈ Multimodal Support
- Process text-only messages
- Analyze images from URLs
- Combined image + text conversations
- OpenAI Vision API compatible format
### οΏ½ Production Ready
- **Enhanced Deployment**: Multi-level fallback for quantized models
- **Environment Flexibility**: Works in constrained deployment environments
- **Error Resilience**: Comprehensive error handling with graceful degradation
- FastAPI backend with automatic docs
- Health checks and monitoring
- PyTorch with MPS acceleration (Apple Silicon)
### πŸ”§ Model Configuration
Configure models via environment variables:
```bash
# Set custom text model (optional)
export AI_MODEL="microsoft/DialoGPT-medium"
# Set custom vision model (optional)
export VISION_MODEL="Salesforce/blip-image-captioning-base"
# For private models (optional)
export HF_TOKEN="your_huggingface_token"
```
**Supported Model Types:**
- Standard models: `microsoft/DialoGPT-medium`, `deepseek-ai/DeepSeek-R1-0528-Qwen3-8B`
- Quantized models: `unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit`
- GGUF models: `unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF`
## πŸš€ Quick Start
### 1. Install Dependencies
```bash
pip install -r requirements.txt
```
### 2. Start the Service
```bash
python backend_service.py
```
### 3. Test Multimodal Capabilities
```bash
python test_final.py
```
The service will start on **http://localhost:8001** with both text and vision models loaded.
## πŸ’‘ Usage Examples
### Text-Only Chat
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
### Image Analysis
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Salesforce/blip-image-captioning-base",
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://example.com/image.jpg"
}
]
}
]
}'
```
### Multimodal (Image + Text)
```bash
curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Salesforce/blip-image-captioning-base",
"messages": [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://example.com/image.jpg"
},
{
"type": "text",
"text": "What do you see in this image?"
}
]
}
]
}'
```
## πŸ”§ Technical Details
### Architecture
- **FastAPI** web framework
- **Transformers** pipeline for AI models
- **PyTorch** backend with GPU/MPS support
- **Pydantic** for request/response validation
### Models
- **Text**: microsoft/DialoGPT-medium
- **Vision**: Salesforce/blip-image-captioning-base
### API Endpoints
- `GET /` - Service information
- `GET /health` - Health check
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions (text/multimodal)
- `GET /docs` - Interactive API documentation
## πŸš€ Deployment
### Environment Variables
```bash
# Optional: Custom models
export AI_MODEL="microsoft/DialoGPT-medium"
export VISION_MODEL="Salesforce/blip-image-captioning-base"
export HF_TOKEN="your_token_here" # For private models
```
### Production Deployment
The service includes enhanced deployment capabilities:
- **Quantized Model Support**: Automatic handling of 4-bit and GGUF models
- **Fallback Mechanisms**: Multi-level fallback for constrained environments
- **Error Resilience**: Graceful degradation when quantization libraries unavailable
### Docker Deployment
```bash
# Build and run with Docker
docker build -t firstai .
docker run -p 8000:8000 firstai
```
### Testing Deployment
```bash
# Test quantization detection and fallbacks
python test_deployment_fallbacks.py
# Test health endpoint
curl http://localhost:8000/health
```
For comprehensive deployment guidance, see `DEPLOYMENT_ENHANCEMENTS.md`.
## πŸ§ͺ Testing
Run the comprehensive test suite:
```bash
python test_final.py
```
Test individual components:
```bash
python test_multimodal.py # Basic multimodal tests
python test_pipeline.py # Pipeline compatibility
```
## πŸ“¦ Dependencies
Key packages:
- `fastapi` - Web framework
- `transformers` - AI model pipelines
- `torch` - PyTorch backend
- `Pillow` - Image processing
- `accelerate` - Model acceleration
- `requests` - HTTP client
## 🎯 Integration Complete
This project successfully integrates:
βœ… **Transformers image-text-to-text pipeline**
βœ… **OpenAI Vision API compatibility**
βœ… **Multimodal message processing**
βœ… **Production-ready FastAPI service**
See `MULTIMODAL_INTEGRATION_COMPLETE.md` for detailed integration documentation.
- PyTorch with MPS acceleration (Apple Silicon) AI Backend Service
emoji: οΏ½
colorFrom: yellow
colorTo: purple
sdk: fastapi
sdk_version: 0.100.0
app_file: backend_service.py
pinned: false
---
# AI Backend Service πŸš€
**Status: βœ… CONVERSION COMPLETE!**
Successfully converted from a non-functioning Gradio HuggingFace app to a production-ready FastAPI backend service with OpenAI-compatible API endpoints.
## Quick Start
### 1. Setup Environment
```bash
# Activate the virtual environment
source gradio_env/bin/activate
# Install dependencies (already done)
pip install -r requirements.txt
```
### 2. Start the Backend Service
```bash
python backend_service.py --port 8000 --reload
```
### 3. Test the API
```bash
# Run comprehensive tests
python test_api.py
# Or try usage examples
python usage_examples.py
```
## API Endpoints
| Endpoint | Method | Description |
| ---------------------- | ------ | ----------------------------------- |
| `/` | GET | Service information |
| `/health` | GET | Health check |
| `/v1/models` | GET | List available models |
| `/v1/chat/completions` | POST | Chat completion (OpenAI compatible) |
| `/v1/completions` | POST | Text completion |
## Example Usage
### Chat Completion
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"max_tokens": 150,
"temperature": 0.7
}'
```
### Streaming Chat
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/DialoGPT-medium",
"messages": [
{"role": "user", "content": "Tell me a joke"}
],
"stream": true
}'
```
## Files
- **`app.py`** - Original Gradio ChatInterface (still functional)
- **`backend_service.py`** - New FastAPI backend service ⭐
- **`test_api.py`** - Comprehensive API testing
- **`usage_examples.py`** - Simple usage examples
- **`requirements.txt`** - Updated dependencies
- **`CONVERSION_COMPLETE.md`** - Detailed conversion documentation
## Features
βœ… **OpenAI-Compatible API** - Drop-in replacement for OpenAI API
βœ… **Async FastAPI** - High-performance async architecture
βœ… **Streaming Support** - Real-time response streaming
βœ… **Error Handling** - Robust error handling with fallbacks
βœ… **Production Ready** - CORS, logging, health checks
βœ… **Docker Ready** - Easy containerization
βœ… **Auto-reload** - Development-friendly auto-reload
βœ… **Type Safety** - Full type hints with Pydantic validation
## Service URLs
- **Backend Service**: http://localhost:8000
- **API Documentation**: http://localhost:8000/docs
- **OpenAPI Spec**: http://localhost:8000/openapi.json
## Model Information
- **Current Model**: `microsoft/DialoGPT-medium`
- **Type**: Conversational AI model
- **Provider**: HuggingFace Inference API
- **Capabilities**: Text generation, chat completion
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client Request │───▢│ FastAPI Backend │───▢│ HuggingFace API β”‚
β”‚ (OpenAI format) β”‚ β”‚ (backend_service) β”‚ β”‚ (DialoGPT-medium) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ OpenAI Response β”‚
β”‚ (JSON/Streaming) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
## Development
The service includes:
- **Auto-reload** for development
- **Comprehensive logging** for debugging
- **Type checking** for code quality
- **Test suite** for reliability
- **Error handling** for robustness
## Production Deployment
Ready for production with:
- **Environment variables** for configuration
- **Health check endpoints** for monitoring
- **CORS support** for web applications
- **Docker compatibility** for containerization
- **Structured logging** for observability
---
**πŸŽ‰ Conversion Status: COMPLETE!**
Successfully transformed from broken Gradio app to production-ready AI backend service.
For detailed conversion documentation, see [`CONVERSION_COMPLETE.md`](CONVERSION_COMPLETE.md).
# Gemma 3n GGUF FastAPI Backend (Hugging Face Space)
This Space provides an OpenAI-compatible chat API for Gemma 3n GGUF models, powered by FastAPI.
**Note:** On Hugging Face Spaces, the backend runs in `DEMO_MODE` (no model loaded) for demonstration and endpoint testing. For real inference, run locally with a GGUF model and llama-cpp-python.
## Endpoints
- `/health` β€” Health check
- `/v1/chat/completions` β€” OpenAI-style chat completions (returns demo response)
- `/train/start` β€” Start a (demo) training job
- `/train/status/{job_id}` β€” Check training job status
- `/train/logs/{job_id}` β€” Get training logs
## Usage
1. **Clone this repo** or create a Hugging Face Space (type: FastAPI).
2. All dependencies are in `requirements.txt`.
3. The Space will start in demo mode (no model download required).
## Local Inference (with GGUF)
To run with a real model locally:
1. Download a Gemma 3n GGUF model (e.g. from https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF).
2. Set `AI_MODEL` to the local path or repo.
3. Unset `DEMO_MODE`.
4. Run:
```bash
pip install -r requirements.txt
uvicorn gemma_gguf_backend:app --host 0.0.0.0 --port 8000
```
## License
Apache 2.0