Spaces:

diabolic6045
/

tts-api

Sleeping

App Files Files Community

Avinyaa commited on May 31

Commit

a7aae29

1 Parent(s): 703fff1

new

Browse files

Files changed (6) hide show

Dockerfile +18 -4
README.md +191 -149
app.py +324 -83
client_example.py +183 -45
requirements.txt +14 -4
test.py +135 -18

Dockerfile CHANGED Viewed

@@ -4,7 +4,12 @@ FROM python:3.11
 RUN useradd -m -u 1000 user
 # Install system dependencies as root
-RUN apt-get update && apt-get install -y git git-lfs espeak-ng && rm -rf /var/lib/apt/lists/*
 # Initialize git lfs
 RUN git lfs install
@@ -15,7 +20,10 @@ USER user
 # Set home to the user's home directory
 ENV HOME=/home/user \
     PATH=/home/user/.local/bin:$PATH \
-    NUMBA_DISABLE_JIT=1
 # Set the working directory to the user's home directory
 WORKDIR $HOME/app
@@ -27,11 +35,17 @@ RUN pip install --no-cache-dir --upgrade pip
 COPY --chown=user requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy the current directory contents into the container at $HOME/app setting the owner to the user
 COPY --chown=user . $HOME/app
 # Expose the port
 EXPOSE 7860
-# Default command - use startup.py for debugging if needed
-CMD ["python", "startup.py"]

 RUN useradd -m -u 1000 user
 # Install system dependencies as root
+RUN apt-get update && apt-get install -y \
+    git \
+    git-lfs \
+    espeak-ng \
+    ffmpeg \
+    && rm -rf /var/lib/apt/lists/*
 # Initialize git lfs
 RUN git lfs install
 # Set home to the user's home directory
 ENV HOME=/home/user \
     PATH=/home/user/.local/bin:$PATH \
+    COQUI_TOS_AGREED=1 \
+    NUMBA_DISABLE_JIT=1 \
+    FORCE_CPU=true \
+    CUDA_VISIBLE_DEVICES=""
 # Set the working directory to the user's home directory
 WORKDIR $HOME/app
 COPY --chown=user requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
+# Download unidic for mecab (required for some TTS features)
+RUN python -m unidic download
+# Clone the C3PO XTTS model
+RUN git clone https://huggingface.co/Borcherding/XTTS-v2_C3PO XTTS-v2_C3PO
 # Copy the current directory contents into the container at $HOME/app setting the owner to the user
 COPY --chown=user . $HOME/app
 # Expose the port
 EXPOSE 7860
+# Start the API directly
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,244 +1,286 @@
 ---
-title: Kokoro TTS API
-emoji: 🎤
 colorFrom: indigo
 colorTo: yellow
 sdk: docker
 pinned: false
 ---
-# Kokoro TTS API
-A FastAPI-based Text-to-Speech API using Kokoro, an open-weight TTS model with 82 million parameters.
 ## Features
-- Convert text to speech using Kokoro TTS
-- Multiple voice options (af_heart, af_sky, af_bella, etc.)
-- Automatic language detection
-- RESTful API with automatic documentation
-- Docker support
-- Lightweight and fast processing
-- Apache-licensed weights
-- Optimized for Hugging Face Spaces deployment
-## About Kokoro
-[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.
-## Setup
-### Hugging Face Spaces Deployment
-This API is optimized for Hugging Face Spaces deployment. The Docker configuration automatically handles:
-- Cache directory setup with proper permissions
-- Environment variable configuration
-- Model downloading and caching
-Simply deploy to Hugging Face Spaces using the Docker SDK.
-#### Troubleshooting on HF Spaces
-If you encounter permission errors, you can use the diagnostic startup script:
-1. Change the Dockerfile CMD to: `CMD ["python", "startup.py"]`
-2. This will run diagnostics and show detailed information about the environment
-### Local Development
-1. Install system dependencies:
 ```bash
-# On Ubuntu/Debian
-sudo apt-get install espeak-ng
-# On macOS
-brew install espeak
 ```
-2. Install Python dependencies:
-```bash
-pip install -r requirements.txt
-```
-3. Run the API:
 ```bash
-uvicorn app:app --host 0.0.0.0 --port 7860
-```
-The API will be available at `http://localhost:7860`
-### Using Docker
-1. Build the Docker image:
-```bash
-docker build -t kokoro-tts-api .
-```
-2. Run the container:
-```bash
-docker run -p 7860:7860 kokoro-tts-api
 ```
 ## API Endpoints
-### Health Check
-- **GET** `/health` - Check API status and device information
-### Available Voices
-- **GET** `/voices` - Get list of available voices
-### Text-to-Speech (Form Data)
-- **POST** `/tts` - Convert text to speech using form data
   - **Parameters:**
-    - `text` (form): Text to convert to speech
-    - `voice` (form): Voice to use (default: "af_heart")
-    - `lang_code` (form): Language code (default: "a" for auto-detect)
-### Text-to-Speech (JSON)
 - **POST** `/tts-json` - Convert text to speech using JSON request body
-  - **Body:** JSON object with `text`, `voice`, and `lang_code` fields
-### API Documentation
 - **GET** `/docs` - Interactive API documentation (Swagger UI)
-- **GET** `/redoc` - Alternative API documentation
-## Available Voices
-- `af_heart` - Female voice (Heart)
-- `af_sky` - Female voice (Sky)
-- `af_bella` - Female voice (Bella)
-- `af_sarah` - Female voice (Sarah)
-- `af_nicole` - Female voice (Nicole)
-- `am_adam` - Male voice (Adam)
-- `am_michael` - Male voice (Michael)
-- `am_edward` - Male voice (Edward)
-- `am_lewis` - Male voice (Lewis)
 ## Usage Examples
-### Using Python requests (Form Data)
 ```python
 import requests
-# Prepare the request
-url = "http://localhost:7860/tts"
 data = {
-    "text": "Hello, this is Kokoro TTS in action!",
-    "voice": "af_heart",
-    "lang_code": "a"
 }
-# Make the request
 response = requests.post(url, data=data)
-# Save the generated audio
 if response.status_code == 200:
-    with open("kokoro_output.wav", "wb") as f:
         f.write(response.content)
-    print("Speech generated successfully!")
 ```
-### Using Python requests (JSON)
 ```python
 import requests
-# Prepare the JSON request
-url = "http://localhost:7860/tts-json"
 data = {
-    "text": "Kokoro delivers high-quality speech synthesis!",
-    "voice": "af_bella",
-    "lang_code": "a"
 }
-headers = {"Content-Type": "application/json"}
-# Make the request
-response = requests.post(url, json=data, headers=headers)
-# Save the generated audio
 if response.status_code == 200:
-    with open("kokoro_json_output.wav", "wb") as f:
         f.write(response.content)
-    print("Speech generated successfully!")
 ```
-### Using curl (Form Data)
-```bash
-curl -X POST "http://localhost:7860/tts" \
-  -F "text=Hello from Kokoro TTS!" \
-  -F "voice=af_heart" \
-  -F "lang_code=a" \
-  --output kokoro_speech.wav
 ```
-### Using curl (JSON)
 ```bash
-curl -X POST "http://localhost:7860/tts-json" \
-  -H "Content-Type: application/json" \
-  -d '{"text":"Hello from Kokoro TTS!","voice":"af_heart","lang_code":"a"}' \
-  --output kokoro_speech.wav
 ```
-### Get Available Voices
 ```bash
-curl http://localhost:7860/voices
 ```
-### Using the provided client example
 ```bash
-python client_example.py
 ```
-## Requirements
-- Python 3.11+
-- espeak-ng system package
-- CUDA-compatible GPU (optional, for faster processing)
 ## Model Information
-This API uses Kokoro TTS, which:
-- Has 82 million parameters
-- Supports multiple voices and languages
-- Provides fast, high-quality speech synthesis
-- Uses Apache-licensed weights
-- Requires minimal system resources compared to larger models
 ## Testing
-Run the standalone test:
 ```bash
 python test.py
 ```
-Run the installation test:
-```bash
-python test_kokoro_install.py
 ```
-For debugging on Hugging Face Spaces:
-```bash
-python startup.py
 ```
-This will generate audio files demonstrating Kokoro's capabilities.
-## Environment Variables
-The following environment variables are automatically configured:
-- `HF_HOME=/tmp/hf_cache` - Hugging Face cache directory
-- `TRANSFORMERS_CACHE=/tmp/hf_cache` - Transformers cache
-- `HF_HUB_CACHE=/tmp/hf_cache` - HF Hub cache
-- `TORCH_HOME=/tmp/torch_cache` - PyTorch cache
-- `NUMBA_CACHE_DIR=/tmp/numba_cache` - Numba cache
-- `NUMBA_DISABLE_JIT=1` - Disable Numba JIT compilation
-These are set automatically by the application for optimal performance on Hugging Face Spaces.

 ---
+title: XTTS C3PO Voice Cloning API
+emoji: 🤖
 colorFrom: indigo
 colorTo: yellow
 sdk: docker
 pinned: false
 ---
+# XTTS C3PO Voice Cloning API
+A FastAPI-based Text-to-Speech API using XTTS-v2 with the iconic C3PO voice from Star Wars.
 ## Features
+- **C3PO Voice**: Pre-loaded with the iconic C3PO voice from Star Wars
+- **Custom Voice Cloning**: Upload your own reference audio for voice cloning
+- **Multilingual Support**: 16+ languages with C3PO voice
+- **No Upload Required**: Use C3PO voice without any file uploads
+- **RESTful API**: Clean API with automatic documentation
+- **Docker Support**: Optimized for Hugging Face Spaces deployment
+- **PyTorch 2.6 Compatible**: Includes compatibility fixes
+## About the C3PO Model
+This API uses the XTTS-v2 C3PO model from [Borcherding/XTTS-v2_C3PO](https://huggingface.co/Borcherding/XTTS-v2_C3PO), which provides the iconic voice of C-3PO from Star Wars. The model supports:
+- High-quality C3PO voice synthesis
+- Multilingual C3PO speech (16+ languages)
+- Custom voice cloning capabilities
+- Real-time speech generation
+## Quick Start
+### Using C3PO Voice (No Upload Required)
 ```bash
+curl -X POST "http://localhost:7860/tts-c3po" \
+  -F "text=Hello there! I am C-3PO, human-cyborg relations." \
+  -F "language=en" \
+  --output c3po_speech.wav
 ```
+### Using Custom Voice Cloning
 ```bash
+curl -X POST "http://localhost:7860/tts" \
+  -F "text=This will be spoken in your custom voice!" \
+  -F "language=en" \
+  -F "speaker_file=@your_reference_voice.wav" \
+  --output custom_speech.wav
 ```
 ## API Endpoints
+### C3PO Voice Only
+- **POST** `/tts-c3po` - Generate speech using C3PO voice (no file upload needed)
+  - **Parameters:**
+    - `text` (form): Text to convert to speech (max 500 characters)
+    - `language` (form): Language code (default: "en")
+    - `no_lang_auto_detect` (form): Disable automatic language detection
+### Voice Cloning with Fallback
+- **POST** `/tts` - Convert text to speech with optional custom voice
   - **Parameters:**
+    - `text` (form): Text to convert to speech (max 500 characters)
+    - `language` (form): Language code (default: "en")
+    - `voice_cleanup` (form): Apply audio cleanup to reference voice
+    - `no_lang_auto_detect` (form): Disable automatic language detection
+    - `speaker_file` (file, optional): Reference speaker audio file (uses C3PO if not provided)
+### JSON API
 - **POST** `/tts-json` - Convert text to speech using JSON request body
+  - **Body:** JSON object with `text`, `language`, `voice_cleanup`, `no_lang_auto_detect`
+  - **File:** `speaker_file` (optional) - Reference speaker audio file
+### Information Endpoints
+- **GET** `/health` - Check API status, device info, and supported languages
+- **GET** `/languages` - Get list of supported languages
 - **GET** `/docs` - Interactive API documentation (Swagger UI)
 ## Usage Examples
+### Python - C3PO Voice
 ```python
 import requests
+# Generate C3PO speech
+url = "http://localhost:7860/tts-c3po"
 data = {
+    "text": "Hello there! I am C-3PO, human-cyborg relations.",
+    "language": "en"
 }
 response = requests.post(url, data=data)
 if response.status_code == 200:
+    with open("c3po_speech.wav", "wb") as f:
         f.write(response.content)
+    print("C3PO speech generated!")
 ```
+### Python - Custom Voice with C3PO Fallback
 ```python
 import requests
+url = "http://localhost:7860/tts"
 data = {
+    "text": "This will use C3PO voice if no speaker file is provided.",
+    "language": "en"
 }
+# No speaker_file provided - will use C3PO voice
+response = requests.post(url, data=data)
 if response.status_code == 200:
+    with open("speech_output.wav", "wb") as f:
         f.write(response.content)
 ```
+### Multilingual C3PO
+```python
+# C3PO speaking Spanish
+data = {
+    "text": "Hola, soy C-3PO. Domino más de seis millones de formas de comunicación.",
+    "language": "es"
+}
+response = requests.post("http://localhost:7860/tts-c3po", data=data)
 ```
+## Supported Languages
+The C3PO model supports all XTTS-v2 languages:
+- **en** - English
+- **es** - Spanish
+- **fr** - French
+- **de** - German
+- **it** - Italian
+- **pt** - Portuguese (Brazilian)
+- **pl** - Polish
+- **tr** - Turkish
+- **ru** - Russian
+- **nl** - Dutch
+- **cs** - Czech
+- **ar** - Arabic
+- **zh-cn** - Mandarin Chinese
+- **ja** - Japanese
+- **ko** - Korean
+- **hu** - Hungarian
+- **hi** - Hindi
+## Setup
+### Hugging Face Spaces Deployment
+This API is optimized for Hugging Face Spaces with:
+- Automatic C3PO model downloading
+- Proper user permissions (user ID 1000)
+- PyTorch 2.6 compatibility fixes
+- COQUI license agreement handling
+### Local Development
+1. **Install system dependencies:**
 ```bash
+# Ubuntu/Debian
+sudo apt-get install espeak-ng ffmpeg git git-lfs
+# macOS
+brew install espeak ffmpeg git git-lfs
 ```
+2. **Install Python dependencies:**
+```bash
+pip install -r requirements.txt
+python -m unidic download
+```
+3. **Clone C3PO model (optional - auto-downloaded on first run):**
+```bash
+git clone https://huggingface.co/Borcherding/XTTS-v2_C3PO XTTS-v2_C3PO
+```
+4. **Run the API:**
 ```bash
+uvicorn app:app --host 0.0.0.0 --port 7860
 ```
+### Using Docker
 ```bash
+# Build and run
+docker build -t xtts-c3po-api .
+docker run -p 7860:7860 xtts-c3po-api
 ```
+## Reference Audio Guidelines
+For custom voice cloning:
+1. **Duration**: 3-10 seconds of clear speech
+2. **Quality**: High-quality audio, minimal background noise
+3. **Format**: WAV format recommended (MP3, M4A also supported)
+4. **Content**: Natural speech, avoid music or effects
+5. **Speaker**: Single speaker, clear pronunciation
 ## Model Information
+- **Base Model**: XTTS-v2
+- **Voice**: C3PO from Star Wars
+- **Source**: [Borcherding/XTTS-v2_C3PO](https://huggingface.co/Borcherding/XTTS-v2_C3PO)
+- **Languages**: 16+ supported
+- **License**: CPML (Coqui Public Model License)
 ## Testing
+Run the test suite:
 ```bash
+# Test C3PO model functionality
 python test.py
+# Test API endpoints
+python client_example.py
 ```
+## Environment Variables
+Automatically configured:
+- `COQUI_TOS_AGREED=1` - Agrees to CPML license
+- `NUMBA_DISABLE_JIT=1` - Disables Numba JIT compilation
+## API Response Examples
+### Health Check Response
+```json
+{
+  "status": "healthy",
+  "device": "cuda",
+  "model": "XTTS-v2 C3PO",
+  "default_voice": "C3PO",
+  "supported_languages": ["en", "es", "fr", ...]
+}
 ```
+### Languages Response
+```json
+{
+  "languages": ["en", "es", "fr", "de", "it", "pt", "pl", "tr", "ru", "nl", "cs", "ar", "zh-cn", "ja", "ko", "hu", "hi"]
+}
 ```
+## Troubleshooting
+### PyTorch Loading Issues
+The API includes fixes for PyTorch 2.6's `weights_only=True` default. If you encounter loading issues, ensure the compatibility fix is applied.
+### Model Download Issues
+If the C3PO model fails to download:
+1. Check internet connection
+2. Verify git and git-lfs are installed
+3. Manually clone: `git clone https://huggingface.co/Borcherding/XTTS-v2_C3PO XTTS-v2_C3PO`
+### Audio Quality Issues
+- Use high-quality reference audio for custom voices
+- Enable `voice_cleanup` for noisy reference audio
+- Ensure reference audio is 3-10 seconds long
+### Memory Issues
+- Use CPU mode for lower memory usage: set `CUDA_VISIBLE_DEVICES=""`
+- Reduce text length for batch processing
+- Consider using GPU with sufficient VRAM (4GB+ recommended)
+## License
+This project uses XTTS-v2 which is licensed under the Coqui Public Model License (CPML). The C3PO model is provided by the community. See https://coqui.ai/cpml for license details.
+## Credits
+- **XTTS-v2**: Coqui AI
+- **C3PO Model**: [Borcherding](https://huggingface.co/Borcherding)
+- **Original Character**: C-3PO from Star Wars (Lucasfilm/Disney)

app.py CHANGED Viewed

@@ -1,173 +1,414 @@
 # Import configuration first to setup environment
 import app_config
-from fastapi import FastAPI, HTTPException, Form
-from fastapi.responses import FileResponse
-from pydantic import BaseModel
-from kokoro import KPipeline
-import soundfile as sf
-import torch
 import os
-import tempfile
 import uuid
 import logging
 from typing import Optional
 # Configure logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-app = FastAPI(title="Kokoro TTS API", description="Text-to-Speech API using Kokoro", version="1.0.0")
 class TTSRequest(BaseModel):
     text: str
-    voice: str = "af_heart"
-    lang_code: str = "a"
-class KokoroTTSService:
     def __init__(self):
         self.device = "cuda" if torch.cuda.is_available() else "cpu"
         logger.info(f"Using device: {self.device}")
-        if app_config.is_hf_spaces():
-            logger.info("Running on Hugging Face Spaces")
         try:
-            # Initialize Kokoro pipeline following the working example pattern
-            logger.info("Initializing Kokoro TTS pipeline...")
-            self.pipeline = KPipeline(lang_code='a')
-            logger.info("Kokoro TTS pipeline loaded successfully")
-        except Exception as e:
-            logger.error(f"Failed to load Kokoro TTS pipeline: {e}")
-            raise e
-    def generate_speech(self, text: str, voice: str = "af_heart", lang_code: str = "a") -> str:
         """Generate speech and return the path to the output file"""
         try:
-            # Create a unique filename for the output
-            output_filename = f"kokoro_output_{uuid.uuid4().hex}.wav"
-            output_path = os.path.join(app_config.get_temp_dir(), output_filename)
-            # Update pipeline language if different
-            if self.pipeline.lang_code != lang_code:
-                logger.info(f"Switching language from {self.pipeline.lang_code} to {lang_code}")
-                self.pipeline = KPipeline(lang_code=lang_code)
-            # Generate speech using Kokoro (following the working example pattern)
-            generator = self.pipeline(text, voice=voice)
-            # Get the first (and typically only) audio output
-            for i, (gs, ps, audio) in enumerate(generator):
-                logger.info(f"Generated audio segment {i}: gs={gs}, ps={ps}")
-                # Save the audio to file
-                sf.write(output_path, audio, 24000)
-                break  # Take the first generated audio
             return output_path
         except Exception as e:
             logger.error(f"Error generating speech: {e}")
             raise HTTPException(status_code=500, detail=f"Failed to generate speech: {str(e)}")
-    def get_available_voices(self):
-        """Return list of available voices"""
-        # Extended list based on the working example
-        return [
-            "af_heart", "af_bella", "af_nicole", "af_aoede", "af_kore",
-            "af_sarah", "af_nova", "af_sky", "af_alloy", "af_jessica", "af_river",
-            "am_michael", "am_fenrir", "am_puck", "am_echo", "am_eric",
-            "am_liam", "am_onyx", "am_santa", "am_adam",
-            "bf_emma", "bf_isabella", "bf_alice", "bf_lily",
-            "bm_george", "bm_fable", "bm_lewis", "bm_daniel"
-        ]
-# Initialize Kokoro TTS service
-tts_service = KokoroTTSService()
 @app.get("/")
 async def root():
-    return {"message": "Kokoro TTS API is running", "status": "healthy"}
 @app.get("/health")
 async def health_check():
-    return {"status": "healthy", "device": tts_service.device}
-@app.get("/voices")
-async def get_voices():
-    """Get list of available voices"""
-    return {"voices": tts_service.get_available_voices()}
 @app.post("/tts")
 async def text_to_speech(
     text: str = Form(...),
-    voice: str = Form("af_heart"),
-    lang_code: str = Form("a")
 ):
     """
-    Convert text to speech using Kokoro TTS
-    - **text**: The text to convert to speech
-    - **voice**: Voice to use (default: "af_heart")
-    - **lang_code**: Language code (default: "a" for auto-detect)
     """
     if not text.strip():
         raise HTTPException(status_code=400, detail="Text cannot be empty")
-    # Validate voice
-    available_voices = tts_service.get_available_voices()
-    if voice not in available_voices:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Voice '{voice}' not available. Available voices: {available_voices}"
-        )
     try:
-        # Generate speech
-        output_path = tts_service.generate_speech(text, voice, lang_code)
         # Return the generated audio file
         return FileResponse(
             output_path,
             media_type="audio/wav",
-            filename=f"kokoro_tts_{voice}_{uuid.uuid4().hex}.wav",
             headers={"Content-Disposition": "attachment"}
         )
     except Exception as e:
         logger.error(f"Error in TTS endpoint: {e}")
         raise HTTPException(status_code=500, detail=str(e))
 @app.post("/tts-json")
-async def text_to_speech_json(request: TTSRequest):
     """
     Convert text to speech using JSON request body
-    - **request**: TTSRequest containing text, voice, and lang_code
     """
     if not request.text.strip():
         raise HTTPException(status_code=400, detail="Text cannot be empty")
-    # Validate voice
-    available_voices = tts_service.get_available_voices()
-    if request.voice not in available_voices:
-        raise HTTPException(
-            status_code=400,
-            detail=f"Voice '{request.voice}' not available. Available voices: {available_voices}"
-        )
     try:
         # Generate speech
-        output_path = tts_service.generate_speech(request.text, request.voice, request.lang_code)
         # Return the generated audio file
         return FileResponse(
             output_path,
             media_type="audio/wav",
-            filename=f"kokoro_tts_{request.voice}_{uuid.uuid4().hex}.wav",
             headers={"Content-Disposition": "attachment"}
         )
     except Exception as e:
         logger.error(f"Error in TTS JSON endpoint: {e}")
         raise HTTPException(status_code=500, detail=str(e))

 # Import configuration first to setup environment
 import app_config
 import os
+import sys
+import io
+import subprocess
 import uuid
+import time
+import torch
+import torchaudio
+import tempfile
 import logging
 from typing import Optional
+# Fix PyTorch weights_only issue for XTTS
+import torch.serialization
+from TTS.tts.configs.xtts_config import XttsConfig
+torch.serialization.add_safe_globals([XttsConfig])
+# Set environment variables
+os.environ["COQUI_TOS_AGREED"] = "1"
+os.environ["NUMBA_DISABLE_JIT"] = "1"
+# Force CPU usage if specified
+if os.environ.get("FORCE_CPU", "false").lower() == "true":
+    os.environ["CUDA_VISIBLE_DEVICES"] = ""
+from fastapi import FastAPI, HTTPException, UploadFile, File, Form
+from fastapi.responses import FileResponse
+from pydantic import BaseModel
+import langid
+from scipy.io.wavfile import write
+from pydub import AudioSegment
+from TTS.api import TTS
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+from TTS.utils.generic_utils import get_user_data_dir
 # Configure logging
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+app = FastAPI(title="XTTS C3PO API", description="Text-to-Speech API using XTTS-v2 C3PO model", version="1.0.0")
 class TTSRequest(BaseModel):
     text: str
+    language: str = "en"
+    voice_cleanup: bool = False
+    no_lang_auto_detect: bool = False
+class XTTSService:
     def __init__(self):
         self.device = "cuda" if torch.cuda.is_available() else "cpu"
         logger.info(f"Using device: {self.device}")
+        # Use the C3PO model path
+        self.model_path = "XTTS-v2_C3PO/"
+        self.config_path = "XTTS-v2_C3PO/config.json"
+        # Check if model files exist, if not download them
+        if not os.path.exists(self.config_path):
+            logger.info("C3PO model not found locally, downloading...")
+            self._download_c3po_model()
+        # Load configuration
+        config = XttsConfig()
+        config.load_json(self.config_path)
+        # Initialize and load model
+        self.model = Xtts.init_from_config(config)
+        self.model.load_checkpoint(
+            config,
+            checkpoint_path=os.path.join(self.model_path, "model.pth"),
+            vocab_path=os.path.join(self.model_path, "vocab.json"),
+            eval=True,
+        )
+        if self.device == "cuda":
+            self.model.cuda()
+        self.supported_languages = config.languages
+        logger.info(f"XTTS C3PO model loaded successfully. Supported languages: {self.supported_languages}")
+        # Set default reference audio (C3PO voice)
+        self.default_reference = os.path.join(self.model_path, "reference.wav")
+        if not os.path.exists(self.default_reference):
+            # Look for any reference audio in the model directory
+            for file in os.listdir(self.model_path):
+                if file.endswith(('.wav', '.mp3', '.m4a')):
+                    self.default_reference = os.path.join(self.model_path, file)
+                    break
+            else:
+                self.default_reference = None
+        if self.default_reference:
+            logger.info(f"Default C3PO reference audio: {self.default_reference}")
+        else:
+            logger.warning("No default reference audio found in C3PO model directory")
+    def _download_c3po_model(self):
+        """Download the C3PO model from Hugging Face"""
         try:
+            logger.info("Downloading C3PO model from Hugging Face...")
+            subprocess.run([
+                "git", "clone",
+                "https://huggingface.co/Borcherding/XTTS-v2_C3PO",
+                "XTTS-v2_C3PO"
+            ], check=True)
+            logger.info("C3PO model downloaded successfully")
+        except subprocess.CalledProcessError as e:
+            logger.error(f"Failed to download C3PO model: {e}")
+            raise HTTPException(status_code=500, detail="Failed to download C3PO model")
+    def generate_speech(self, text: str, speaker_wav_path: str = None, language: str = "en",
+                       voice_cleanup: bool = False, no_lang_auto_detect: bool = False) -> str:
         """Generate speech and return the path to the output file"""
         try:
+            # Use default C3PO voice if no speaker file provided
+            if speaker_wav_path is None:
+                if self.default_reference is None:
+                    raise HTTPException(status_code=400, detail="No reference audio available. Please upload a speaker file.")
+                speaker_wav_path = self.default_reference
+                logger.info("Using default C3PO voice")
+            # Validate language
+            if language not in self.supported_languages:
+                raise HTTPException(status_code=400, detail=f"Language '{language}' not supported. Supported: {self.supported_languages}")
+            # Language detection for longer texts
+            if len(text) > 15 and not no_lang_auto_detect:
+                language_predicted = langid.classify(text)[0].strip()
+                if language_predicted == "zh":
+                    language_predicted = "zh-cn"
+                if language_predicted != language:
+                    logger.warning(f"Detected language: {language_predicted}, chosen: {language}")
+            # Text length validation
+            if len(text) < 2:
+                raise HTTPException(status_code=400, detail="Text too short, please provide longer text")
+            if len(text) > 500:  # Increased limit for API
+                raise HTTPException(status_code=400, detail="Text too long, maximum 500 characters")
+            # Voice cleanup if requested
+            processed_speaker_wav = speaker_wav_path
+            if voice_cleanup:
+                processed_speaker_wav = self._cleanup_audio(speaker_wav_path)
+            # Generate conditioning latents
+            try:
+                gpt_cond_latent, speaker_embedding = self.model.get_conditioning_latents(
+                    audio_path=processed_speaker_wav,
+                    gpt_cond_len=30,
+                    gpt_cond_chunk_len=4,
+                    max_ref_length=60
+                )
+            except Exception as e:
+                logger.error(f"Speaker encoding error: {e}")
+                raise HTTPException(status_code=400, detail="Error processing reference audio. Please check the audio file.")
+            # Generate speech
+            logger.info("Generating speech...")
+            start_time = time.time()
+            out = self.model.inference(
+                text,
+                language,
+                gpt_cond_latent,
+                speaker_embedding,
+                repetition_penalty=5.0,
+                temperature=0.75,
+            )
+            inference_time = time.time() - start_time
+            logger.info(f"Speech generation completed in {inference_time:.2f} seconds")
+            # Save output
+            output_filename = f"xtts_c3po_output_{uuid.uuid4().hex}.wav"
+            output_path = os.path.join(tempfile.gettempdir(), output_filename)
+            torchaudio.save(output_path, torch.tensor(out["wav"]).unsqueeze(0), 24000)
             return output_path
         except Exception as e:
             logger.error(f"Error generating speech: {e}")
+            if isinstance(e, HTTPException):
+                raise e
             raise HTTPException(status_code=500, detail=f"Failed to generate speech: {str(e)}")
+    def _cleanup_audio(self, audio_path: str) -> str:
+        """Apply audio cleanup filters"""
+        try:
+            output_path = audio_path + "_cleaned.wav"
+            # Basic audio cleanup using ffmpeg-python or similar
+            # For now, just return the original path
+            # You can implement more sophisticated cleanup here
+            return audio_path
+        except Exception as e:
+            logger.warning(f"Audio cleanup failed: {e}, using original audio")
+            return audio_path
+# Initialize XTTS service
+logger.info("Initializing XTTS C3PO service...")
+tts_service = XTTSService()
 @app.get("/")
 async def root():
+    return {"message": "XTTS C3PO API is running", "status": "healthy", "model": "C3PO"}
 @app.get("/health")
 async def health_check():
+    return {
+        "status": "healthy",
+        "device": tts_service.device,
+        "model": "XTTS-v2 C3PO",
+        "supported_languages": tts_service.supported_languages,
+        "default_voice": "C3PO" if tts_service.default_reference else "None"
+    }
+@app.get("/languages")
+async def get_languages():
+    """Get list of supported languages"""
+    return {"languages": tts_service.supported_languages}
 @app.post("/tts")
 async def text_to_speech(
     text: str = Form(...),
+    language: str = Form("en"),
+    voice_cleanup: bool = Form(False),
+    no_lang_auto_detect: bool = Form(False),
+    speaker_file: UploadFile = File(None)
 ):
     """
+    Convert text to speech using XTTS C3PO voice cloning
+    - **text**: The text to convert to speech (max 500 characters)
+    - **language**: Language code (default: "en")
+    - **voice_cleanup**: Apply audio cleanup to reference voice
+    - **no_lang_auto_detect**: Disable automatic language detection
+    - **speaker_file**: Reference speaker audio file (optional, uses C3PO voice if not provided)
     """
     if not text.strip():
         raise HTTPException(status_code=400, detail="Text cannot be empty")
+    speaker_temp_path = None
     try:
+        # Handle speaker file if provided
+        if speaker_file is not None:
+            # Validate file type
+            if not speaker_file.content_type.startswith('audio/'):
+                raise HTTPException(status_code=400, detail="Speaker file must be an audio file")
+            # Save uploaded speaker file temporarily
+            speaker_temp_path = os.path.join(tempfile.gettempdir(), f"speaker_{uuid.uuid4().hex}.wav")
+            with open(speaker_temp_path, "wb") as buffer:
+                content = await speaker_file.read()
+                buffer.write(content)
+        # Generate speech (will use C3PO voice if no speaker file provided)
+        output_path = tts_service.generate_speech(
+            text,
+            speaker_temp_path,
+            language,
+            voice_cleanup,
+            no_lang_auto_detect
+        )
+        # Clean up temporary speaker file
+        if speaker_temp_path and os.path.exists(speaker_temp_path):
+            try:
+                os.remove(speaker_temp_path)
+            except:
+                pass
         # Return the generated audio file
+        voice_type = "custom" if speaker_file else "c3po"
         return FileResponse(
             output_path,
             media_type="audio/wav",
+            filename=f"xtts_{voice_type}_output_{uuid.uuid4().hex}.wav",
             headers={"Content-Disposition": "attachment"}
         )
     except Exception as e:
+        # Clean up files in case of error
+        if speaker_temp_path and os.path.exists(speaker_temp_path):
+            try:
+                os.remove(speaker_temp_path)
+            except:
+                pass
         logger.error(f"Error in TTS endpoint: {e}")
+        if isinstance(e, HTTPException):
+            raise e
         raise HTTPException(status_code=500, detail=str(e))
 @app.post("/tts-json")
+async def text_to_speech_json(
+    request: TTSRequest,
+    speaker_file: UploadFile = File(None)
+):
     """
     Convert text to speech using JSON request body
+    - **request**: TTSRequest containing text, language, and options
+    - **speaker_file**: Reference speaker audio file (optional, uses C3PO voice if not provided)
     """
     if not request.text.strip():
         raise HTTPException(status_code=400, detail="Text cannot be empty")
+    speaker_temp_path = None
     try:
+        # Handle speaker file if provided
+        if speaker_file is not None:
+            # Validate file type
+            if not speaker_file.content_type.startswith('audio/'):
+                raise HTTPException(status_code=400, detail="Speaker file must be an audio file")
+            # Save uploaded speaker file temporarily
+            speaker_temp_path = os.path.join(tempfile.gettempdir(), f"speaker_{uuid.uuid4().hex}.wav")
+            with open(speaker_temp_path, "wb") as buffer:
+                content = await speaker_file.read()
+                buffer.write(content)
         # Generate speech
+        output_path = tts_service.generate_speech(
+            request.text,
+            speaker_temp_path,
+            request.language,
+            request.voice_cleanup,
+            request.no_lang_auto_detect
+        )
+        # Clean up temporary speaker file
+        if speaker_temp_path and os.path.exists(speaker_temp_path):
+            try:
+                os.remove(speaker_temp_path)
+            except:
+                pass
         # Return the generated audio file
+        voice_type = "custom" if speaker_file else "c3po"
         return FileResponse(
             output_path,
             media_type="audio/wav",
+            filename=f"xtts_{voice_type}_{request.language}_{uuid.uuid4().hex}.wav",
             headers={"Content-Disposition": "attachment"}
         )
     except Exception as e:
+        # Clean up files in case of error
+        if speaker_temp_path and os.path.exists(speaker_temp_path):
+            try:
+                os.remove(speaker_temp_path)
+            except:
+                pass
         logger.error(f"Error in TTS JSON endpoint: {e}")
+        if isinstance(e, HTTPException):
+            raise e
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/tts-c3po")
+async def text_to_speech_c3po_only(
+    text: str = Form(...),
+    language: str = Form("en"),
+    no_lang_auto_detect: bool = Form(False)
+):
+    """
+    Convert text to speech using C3PO voice only (no file upload needed)
+    - **text**: The text to convert to speech (max 500 characters)
+    - **language**: Language code (default: "en")
+    - **no_lang_auto_detect**: Disable automatic language detection
+    """
+    if not text.strip():
+        raise HTTPException(status_code=400, detail="Text cannot be empty")
+    try:
+        # Generate speech using C3PO voice
+        output_path = tts_service.generate_speech(
+            text,
+            None,  # Use default C3PO voice
+            language,
+            False,  # No voice cleanup needed for default voice
+            no_lang_auto_detect
+        )
+        # Return the generated audio file
+        return FileResponse(
+            output_path,
+            media_type="audio/wav",
+            filename=f"c3po_voice_{uuid.uuid4().hex}.wav",
+            headers={"Content-Disposition": "attachment"}
+        )
+    except Exception as e:
+        logger.error(f"Error in C3PO TTS endpoint: {e}")
+        if isinstance(e, HTTPException):
+            raise e
         raise HTTPException(status_code=500, detail=str(e))

client_example.py CHANGED Viewed

@@ -1,34 +1,34 @@
 import requests
-import json
-def test_kokoro_tts_api():
-    """Example of how to use the Kokoro TTS API"""
-    # API endpoint
-    url = "http://localhost:7860/tts"
-    # Text to convert to speech (using the example from the user's request)
-    text = """
-[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
-"""
     # Prepare the request data
     data = {
         "text": text,
-        "voice": "af_heart",  # Available voices: af_heart, af_sky, af_bella, etc.
-        "lang_code": "a"      # 'a' for auto-detect
     }
     try:
-        print("Sending request to Kokoro TTS API...")
         response = requests.post(url, data=data)
         if response.status_code == 200:
             # Save the generated audio
-            output_filename = "kokoro_generated_speech.wav"
             with open(output_filename, "wb") as f:
                 f.write(response.content)
-            print(f"Success! Generated speech saved as {output_filename}")
         else:
             print(f"Error: {response.status_code}")
             print(response.text)
@@ -38,36 +38,92 @@ def test_kokoro_tts_api():
     except Exception as e:
         print(f"Error: {e}")
-def test_kokoro_tts_json_api():
-    """Example of using the JSON endpoint"""
     # API endpoint
-    url = "http://localhost:7860/tts-json"
     # Text to convert to speech
-    text = "Hello, this is a test of the Kokoro TTS system using the JSON API endpoint."
-    # Prepare the JSON request
     data = {
         "text": text,
-        "voice": "af_bella",
-        "lang_code": "a"
     }
-    headers = {
-        "Content-Type": "application/json"
     }
     try:
-        print("Sending JSON request to Kokoro TTS API...")
-        response = requests.post(url, json=data, headers=headers)
         if response.status_code == 200:
             # Save the generated audio
-            output_filename = "kokoro_json_speech.wav"
             with open(output_filename, "wb") as f:
                 f.write(response.content)
-            print(f"Success! Generated speech saved as {output_filename}")
         else:
             print(f"Error: {response.status_code}")
             print(response.text)
@@ -77,16 +133,60 @@ def test_kokoro_tts_json_api():
     except Exception as e:
         print(f"Error: {e}")
-def get_available_voices():
-    """Get list of available voices"""
     try:
-        response = requests.get("http://localhost:7860/voices")
         if response.status_code == 200:
-            voices = response.json()
-            print("Available voices:", voices["voices"])
-            return voices["voices"]
         else:
-            print("Failed to get voices:", response.status_code)
             return []
     except requests.exceptions.ConnectionError:
         print("API is not running. Start it with: uvicorn app:app --host 0.0.0.0 --port 7860")
@@ -97,7 +197,13 @@ def check_api_health():
     try:
         response = requests.get("http://localhost:7860/health")
         if response.status_code == 200:
-            print("API is healthy:", response.json())
             return True
         else:
             print("API health check failed:", response.status_code)
@@ -106,26 +212,58 @@ def check_api_health():
         print("API is not running. Start it with: uvicorn app:app --host 0.0.0.0 --port 7860")
         return False
 if __name__ == "__main__":
-    print("Kokoro TTS API Client Example")
-    print("=" * 35)
     # First check if API is running
     if check_api_health():
         print()
-        # Get available voices
-        voices = get_available_voices()
         print()
-        # Test the TTS functionality with form data
-        print("Testing form-data endpoint...")
-        test_kokoro_tts_api()
         print()
-        # Test the TTS functionality with JSON
-        print("Testing JSON endpoint...")
-        test_kokoro_tts_json_api()
     else:
         print("\nPlease start the API server first:")
         print("uvicorn app:app --host 0.0.0.0 --port 7860")

 import requests
+import os
+def test_c3po_voice():
+    """Test the C3PO voice without uploading any files"""
+    # API endpoint for C3PO voice only
+    url = "http://localhost:7860/tts-c3po"
+    # Text to convert to speech
+    text = "Hello there! I am C-3PO, human-cyborg relations. How may I assist you today?"
     # Prepare the request data
     data = {
         "text": text,
+        "language": "en",
+        "no_lang_auto_detect": False
     }
     try:
+        print("Testing C3PO voice...")
+        print(f"Text: {text}")
         response = requests.post(url, data=data)
         if response.status_code == 200:
             # Save the generated audio
+            output_filename = "c3po_voice_sample.wav"
             with open(output_filename, "wb") as f:
                 f.write(response.content)
+            print(f"Success! C3PO voice sample saved as {output_filename}")
         else:
             print(f"Error: {response.status_code}")
             print(response.text)
     except Exception as e:
         print(f"Error: {e}")
+def test_xtts_with_custom_voice():
+    """Example of using XTTS with custom voice upload"""
     # API endpoint
+    url = "http://localhost:7860/tts"
     # Text to convert to speech
+    text = "This is a test of XTTS voice cloning with a custom reference voice."
+    # Path to your speaker reference audio file
+    speaker_file_path = "reference.wav"  # Update this path to your reference audio
+    # Check if speaker file exists
+    if not os.path.exists(speaker_file_path):
+        print(f"Custom voice test skipped: Speaker file not found at {speaker_file_path}")
+        print("To test custom voice cloning:")
+        print("1. Record 3-10 seconds of clear speech")
+        print("2. Save as 'reference.wav' in this directory")
+        print("3. Run this test again")
+        return
+    # Prepare the request data
     data = {
         "text": text,
+        "language": "en",
+        "voice_cleanup": False,
+        "no_lang_auto_detect": False
+    }
+    files = {
+        "speaker_file": open(speaker_file_path, "rb")
     }
+    try:
+        print("Testing XTTS with custom voice...")
+        print(f"Text: {text}")
+        print(f"Speaker file: {speaker_file_path}")
+        response = requests.post(url, data=data, files=files)
+        if response.status_code == 200:
+            # Save the generated audio
+            output_filename = "custom_voice_clone.wav"
+            with open(output_filename, "wb") as f:
+                f.write(response.content)
+            print(f"Success! Custom voice clone saved as {output_filename}")
+        else:
+            print(f"Error: {response.status_code}")
+            print(response.text)
+    except requests.exceptions.ConnectionError:
+        print("Error: Could not connect to the API. Make sure the server is running on http://localhost:7860")
+    except Exception as e:
+        print(f"Error: {e}")
+    finally:
+        files["speaker_file"].close()
+def test_xtts_fallback_to_c3po():
+    """Test XTTS endpoint without speaker file (should use C3PO voice)"""
+    # API endpoint
+    url = "http://localhost:7860/tts"
+    # Text to convert to speech
+    text = "When no custom voice is provided, I will speak in the C3PO voice by default."
+    # Prepare the request data (no speaker file)
+    data = {
+        "text": text,
+        "language": "en",
+        "voice_cleanup": False,
+        "no_lang_auto_detect": False
     }
     try:
+        print("Testing XTTS fallback to C3PO voice...")
+        print(f"Text: {text}")
+        response = requests.post(url, data=data)
         if response.status_code == 200:
             # Save the generated audio
+            output_filename = "xtts_c3po_fallback.wav"
             with open(output_filename, "wb") as f:
                 f.write(response.content)
+            print(f"Success! XTTS with C3PO fallback saved as {output_filename}")
         else:
             print(f"Error: {response.status_code}")
             print(response.text)
     except Exception as e:
         print(f"Error: {e}")
+def test_multilingual_c3po():
+    """Test C3PO voice in different languages"""
+    # API endpoint for C3PO voice only
+    url = "http://localhost:7860/tts-c3po"
+    # Test different languages
+    test_cases = [
+        ("en", "Hello, I am C-3PO. I am fluent in over six million forms of communication."),
+        ("es", "Hola, soy C-3PO. Domino más de seis millones de formas de comunicación."),
+        ("fr", "Bonjour, je suis C-3PO. Je maîtrise plus de six millions de formes de communication."),
+        ("de", "Hallo, ich bin C-3PO. Ich beherrsche über sechs Millionen Kommunikationsformen."),
+    ]
+    for language, text in test_cases:
+        data = {
+            "text": text,
+            "language": language,
+            "no_lang_auto_detect": True  # Force the specified language
+        }
+        try:
+            print(f"Testing C3PO voice in {language.upper()}...")
+            print(f"Text: {text}")
+            response = requests.post(url, data=data)
+            if response.status_code == 200:
+                # Save the generated audio
+                output_filename = f"c3po_voice_{language}.wav"
+                with open(output_filename, "wb") as f:
+                    f.write(response.content)
+                print(f"Success! C3PO {language} voice saved as {output_filename}")
+            else:
+                print(f"Error: {response.status_code}")
+                print(response.text)
+        except requests.exceptions.ConnectionError:
+            print("Error: Could not connect to the API. Make sure the server is running on http://localhost:7860")
+        except Exception as e:
+            print(f"Error: {e}")
+        print()  # Add spacing between tests
+def get_supported_languages():
+    """Get list of supported languages"""
     try:
+        response = requests.get("http://localhost:7860/languages")
         if response.status_code == 200:
+            languages = response.json()
+            print("Supported languages:", languages["languages"])
+            return languages["languages"]
         else:
+            print("Failed to get languages:", response.status_code)
             return []
     except requests.exceptions.ConnectionError:
         print("API is not running. Start it with: uvicorn app:app --host 0.0.0.0 --port 7860")
     try:
         response = requests.get("http://localhost:7860/health")
         if response.status_code == 200:
+            health_info = response.json()
+            print("API Health Check:")
+            print(f"  Status: {health_info['status']}")
+            print(f"  Device: {health_info['device']}")
+            print(f"  Model: {health_info['model']}")
+            print(f"  Default Voice: {health_info['default_voice']}")
+            print(f"  Languages: {len(health_info['supported_languages'])} supported")
             return True
         else:
             print("API health check failed:", response.status_code)
         print("API is not running. Start it with: uvicorn app:app --host 0.0.0.0 --port 7860")
         return False
+def create_sample_reference():
+    """Instructions for creating a reference audio file"""
+    print("\n" + "="*50)
+    print("REFERENCE AUDIO SETUP")
+    print("="*50)
+    print("To use XTTS voice cloning, you need a reference audio file:")
+    print("1. Record 3-10 seconds of clear speech")
+    print("2. Save as WAV format (recommended)")
+    print("3. Ensure good audio quality (no background noise)")
+    print("4. Place the file in the same directory as this script")
+    print("5. Update the 'speaker_file_path' variable in the functions above")
+    print("\nExample recording text:")
+    print("'Hello, this is my voice. I'm recording this sample for voice cloning.'")
+    print("="*50)
 if __name__ == "__main__":
+    print("XTTS C3PO API Client Example")
+    print("=" * 40)
     # First check if API is running
     if check_api_health():
         print()
+        # Get supported languages
+        languages = get_supported_languages()
+        print()
+        # Test C3PO voice (no file upload needed)
+        print("1. Testing C3PO voice (no upload required)...")
+        test_c3po_voice()
         print()
+        # Test XTTS fallback to C3PO
+        print("2. Testing XTTS endpoint without speaker file (C3PO fallback)...")
+        test_xtts_fallback_to_c3po()
         print()
+        # Test custom voice if reference file exists
+        print("3. Testing custom voice cloning...")
+        test_xtts_with_custom_voice()
+        print()
+        # Test multilingual C3PO
+        print("4. Testing multilingual C3PO voice...")
+        test_multilingual_c3po()
+        print("All tests completed!")
+        print("\nGenerated files:")
+        for file in os.listdir("."):
+            if file.endswith(".wav") and ("c3po" in file or "custom" in file or "xtts" in file):
+                print(f"  - {file}")
     else:
         print("\nPlease start the API server first:")
         print("uvicorn app:app --host 0.0.0.0 --port 7860")

requirements.txt CHANGED Viewed

@@ -1,7 +1,17 @@
-kokoro>=0.9.2
-soundfile
 fastapi
 uvicorn[standard]
-python-multipart
 torch
-torchaudio

+TTS @ git+https://github.com/coqui-ai/TTS@v0.21.1
+pydantic==1.10.13
+python-multipart==0.0.6
+typing-extensions>=4.8.0
+cutlet
+mecab-python3==1.0.6
+unidic-lite==1.0.8
+unidic==1.1.0
+langid
+pydub
 fastapi
 uvicorn[standard]
 torch
+torchaudio
+soundfile
+scipy
+numpy

test.py CHANGED Viewed

@@ -1,28 +1,145 @@
 import os
-# Set basic environment variables
 os.environ['NUMBA_DISABLE_JIT'] = '1'
-from kokoro import KPipeline
-import soundfile as sf
-import torch
-# Initialize Kokoro pipeline
-pipeline = KPipeline(lang_code='a')
 # Text to convert to speech
-text = '''
-[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
-'''
-# Generate speech using Kokoro
-generator = pipeline(text, voice='af_heart')
-# Process and save the generated audio
-for i, (gs, ps, audio) in enumerate(generator):
-    print(f"Segment {i}: gs={gs}, ps={ps}")
-    # Save each segment as a separate file
-    sf.write(f'{i}.wav', audio, 24000)
-    print(f"Saved segment {i} as {i}.wav")
-print("Speech generation completed!")

 import os
+import torch
+import torchaudio
+import subprocess
+# Fix PyTorch weights_only issue for XTTS
+import torch.serialization
+from TTS.tts.configs.xtts_config import XttsConfig
+torch.serialization.add_safe_globals([XttsConfig])
+# Set environment variables
+os.environ['COQUI_TOS_AGREED'] = '1'
 os.environ['NUMBA_DISABLE_JIT'] = '1'
+from TTS.api import TTS
+from TTS.tts.configs.xtts_config import XttsConfig
+from TTS.tts.models.xtts import Xtts
+from TTS.utils.generic_utils import get_user_data_dir
+print("Testing XTTS C3PO voice cloning...")
+# C3PO model path
+model_path = "XTTS-v2_C3PO/"
+config_path = "XTTS-v2_C3PO/config.json"
+# Check if model files exist, if not download them
+if not os.path.exists(config_path):
+    print("C3PO model not found locally, downloading...")
+    try:
+        subprocess.run([
+            "git", "clone",
+            "https://huggingface.co/Borcherding/XTTS-v2_C3PO",
+            "XTTS-v2_C3PO"
+        ], check=True)
+        print("C3PO model downloaded successfully")
+    except subprocess.CalledProcessError as e:
+        print(f"Failed to download C3PO model: {e}")
+        exit(1)
+# Load configuration
+config = XttsConfig()
+config.load_json(config_path)
+# Initialize and load model
+model = Xtts.init_from_config(config)
+model.load_checkpoint(
+    config,
+    checkpoint_path=os.path.join(model_path, "model.pth"),
+    vocab_path=os.path.join(model_path, "vocab.json"),
+    eval=True,
+)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+if device == "cuda":
+    model.cuda()
+print(f"C3PO model loaded on {device}")
 # Text to convert to speech
+text = "Hello there! I am C-3PO, human-cyborg relations. How may I assist you today?"
+# Look for reference audio in the C3PO model directory
+reference_audio_path = None
+for file in os.listdir(model_path):
+    if file.endswith(('.wav', '.mp3', '.m4a')):
+        reference_audio_path = os.path.join(model_path, file)
+        print(f"Found C3PO reference audio: {file}")
+        break
+# If no reference audio found, create a simple test reference
+if reference_audio_path is None:
+    print("No reference audio found in C3PO model, creating test reference...")
+    reference_audio_path = "test_reference.wav"
+    # Generate a simple sine wave as placeholder
+    import numpy as np
+    sample_rate = 24000
+    duration = 3  # seconds
+    frequency = 440  # Hz
+    t = np.linspace(0, duration, int(sample_rate * duration))
+    audio_data = 0.3 * np.sin(2 * np.pi * frequency * t)
+    # Save as WAV
+    torchaudio.save(reference_audio_path, torch.tensor(audio_data).unsqueeze(0), sample_rate)
+    print(f"Test reference audio created: {reference_audio_path}")
+try:
+    # Generate conditioning latents
+    print("Processing reference audio...")
+    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(
+        audio_path=reference_audio_path,
+        gpt_cond_len=30,
+        gpt_cond_chunk_len=4,
+        max_ref_length=60
+    )
+    # Generate speech
+    print("Generating C3PO speech...")
+    out = model.inference(
+        text,
+        "en",  # language
+        gpt_cond_latent,
+        speaker_embedding,
+        repetition_penalty=5.0,
+        temperature=0.75,
+    )
+    # Save output
+    output_path = "c3po_test_output.wav"
+    torchaudio.save(output_path, torch.tensor(out["wav"]).unsqueeze(0), 24000)
+    print(f"C3PO speech generated successfully! Saved as: {output_path}")
+    # Test multilingual capabilities
+    print("\nTesting multilingual C3PO...")
+    multilingual_tests = [
+        ("es", "Hola, soy C-3PO. Domino más de seis millones de formas de comunicación."),
+        ("fr", "Bonjour, je suis C-3PO. Je maîtrise plus de six millions de formes de communication."),
+        ("de", "Hallo, ich bin C-3PO. Ich beherrsche über sechs Millionen Kommunikationsformen."),
+    ]
+    for lang, test_text in multilingual_tests:
+        print(f"Generating {lang.upper()} speech...")
+        out = model.inference(
+            test_text,
+            lang,
+            gpt_cond_latent,
+            speaker_embedding,
+            repetition_penalty=5.0,
+            temperature=0.75,
+        )
+        output_path = f"c3po_test_{lang}.wav"
+        torchaudio.save(output_path, torch.tensor(out["wav"]).unsqueeze(0), 24000)
+        print(f"C3PO {lang.upper()} speech saved as: {output_path}")
+except Exception as e:
+    print(f"Error during speech generation: {e}")
+    import traceback
+    traceback.print_exc()
+print("XTTS C3PO test completed!")
+print("\nGenerated files:")
+for file in os.listdir("."):
+    if file.startswith("c3po_test") and file.endswith(".wav"):
+        print(f"  - {file}")