Spaces:

DroolingPanda
/

teachingAssistant

Sleeping

App Files Files Community

Michael Hu commited on May 3

Commit

60bd17d

1 Parent(s): 27972f7

refactor tts

Browse files

Files changed (6) hide show

README.md +77 -0
utils/tts.py +45 -122
utils/tts_base.py +1 -7
utils/tts_factory.py +1 -67
utils/tts_kokoro.py +106 -0
utils/tts_kokoro_space.py +100 -0

README.md CHANGED Viewed

@@ -10,3 +10,80 @@ pinned: false
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Speech Recognition Module Refactoring
+## Overview
+The speech recognition module (`utils/stt.py`) has been refactored to support multiple ASR (Automatic Speech Recognition) models. The implementation now follows a factory pattern that allows easy switching between different speech recognition models while maintaining a consistent interface.
+## Supported Models
+### 1. Whisper (Default)
+- Based on OpenAI's Whisper Large-v3 model
+- High accuracy for general speech recognition
+- No additional installation required
+### 2. Parakeet
+- NVIDIA's Parakeet-TDT-0.6B model
+- Optimized for real-time transcription
+- Requires additional installation (see below)
+## Installation
+### For Parakeet Support
+To use the Parakeet model, you need to install the NeMo Toolkit:
+```bash
+pip install -U 'nemo_toolkit[asr]'
+```
+Alternatively, you can use the provided requirements file:
+```bash
+pip install -r requirements-parakeet.txt
+```
+## Usage
+### In the Web Application
+The web application now includes a dropdown menu to select the ASR model. Simply choose your preferred model before uploading an audio file.
+### Programmatic Usage
+```python
+from utils.stt import transcribe_audio
+# Using the default Whisper model
+text = transcribe_audio("path/to/audio.wav")
+# Using the Parakeet model
+text = transcribe_audio("path/to/audio.wav", model_name="parakeet")
+```
+### Direct Model Access
+For more advanced usage, you can directly access the model classes:
+```python
+from utils.stt import ASRFactory
+# Get a specific model instance
+whisper_model = ASRFactory.get_model("whisper")
+parakeet_model = ASRFactory.get_model("parakeet")
+# Use the model directly
+text = whisper_model.transcribe("path/to/audio.wav")
+```
+## Architecture
+The refactored code follows these design patterns:
+1. **Abstract Base Class**: `ASRModel` defines the interface for all speech recognition models
+2. **Factory Pattern**: `ASRFactory` creates the appropriate model instance based on the requested model name
+3. **Strategy Pattern**: Different model implementations can be swapped at runtime
+This architecture makes it easy to add support for additional ASR models in the future.

utils/tts.py CHANGED Viewed

@@ -3,130 +3,53 @@ import logging
 # Configure logging
 logger = logging.getLogger(__name__)
-# Import from the new factory pattern implementation
-from utils.tts_factory import get_tts_engine, generate_speech, TTSFactory
-from utils.tts_engines import get_available_engines
-# For backward compatibility
-from utils.tts_engines import KOKORO_AVAILABLE, KOKORO_SPACE_AVAILABLE, DIA_AVAILABLE
-# Backward compatibility class
-class TTSEngine:
-    """Legacy TTSEngine class for backward compatibility
-    This class is maintained for backward compatibility with existing code.
-    New code should use the factory pattern implementation directly.
-    """
-    def __init__(self, lang_code='z'):
-        """Initialize TTS Engine using the factory pattern
-        Args:
-            lang_code (str): Language code ('a' for US English, 'b' for British English,
-                           'j' for Japanese, 'z' for Mandarin Chinese)
-        """
-        logger.info("Initializing legacy TTSEngine wrapper")
-        logger.info(f"Available engines - Kokoro: {KOKORO_AVAILABLE}, Dia: {DIA_AVAILABLE}")
-        # Create the appropriate engine using the factory
-        self._engine = TTSFactory.create_engine(lang_code=lang_code)
-        # Set engine_type for backward compatibility
-        engine_class = self._engine.__class__.__name__
-        if 'Kokoro' in engine_class and 'Space' in engine_class:
-            self.engine_type = "kokoro_space"
-        elif 'Kokoro' in engine_class:
-            self.engine_type = "kokoro"
-        elif 'Dia' in engine_class:
-            self.engine_type = "dia"
-        else:
-            self.engine_type = "dummy"
-        # Set pipeline and client attributes for backward compatibility
-        self.pipeline = getattr(self._engine, 'pipeline', None)
-        self.client = getattr(self._engine, 'client', None)
-        logger.info(f"Legacy TTSEngine wrapper initialized with engine type: {self.engine_type}")
-    def generate_speech(self, text: str, voice: str = 'af_heart', speed: float = 1.0) -> str:
-        """Generate speech from text using available TTS engine
-        Args:
-            text (str): Input text to synthesize
-            voice (str): Voice ID to use (e.g., 'af_heart', 'af_bella', etc.)
-            speed (float): Speech speed multiplier (0.5 to 2.0)
-        Returns:
-            str: Path to the generated audio file
-        """
-        logger.info(f"Legacy TTSEngine wrapper calling generate_speech for text length: {len(text)}")
-        return self._engine.generate_speech(text, voice, speed)
-    def generate_speech_stream(self, text: str, voice: str = 'af_heart', speed: float = 1.0):
-        """Generate speech from text and yield each segment
-        Args:
-            text (str): Input text to synthesize
-            voice (str): Voice ID to use
-            speed (float): Speech speed multiplier
-        Yields:
-            tuple: (sample_rate, audio_data) pairs for each segment
-        """
-        logger.info(f"Legacy TTSEngine wrapper calling generate_speech_stream for text length: {len(text)}")
-        yield from self._engine.generate_speech_stream(text, voice, speed)
-    # For backward compatibility
-    def _generate_dummy_audio(self, output_path):
-        """Generate a dummy audio file with a simple sine wave (backward compatibility)
-        Args:
-            output_path (str): Path to save the dummy audio file
-        Returns:
-            str: Path to the generated dummy audio file
-        """
-        from utils.tts_base import DummyTTSEngine
-        dummy_engine = DummyTTSEngine()
-        return dummy_engine.generate_speech("", "", 1.0)
-    # For backward compatibility
-    def _generate_dummy_audio_stream(self):
-        """Generate dummy audio chunks (backward compatibility)
-        Yields:
-            tuple: (sample_rate, audio_data) pairs for each dummy segment
-        """
-        from utils.tts_base import DummyTTSEngine
-        dummy_engine = DummyTTSEngine()
-        yield from dummy_engine.generate_speech_stream("", "", 1.0)
-# Import the new implementations from tts_base
-# These functions are already defined in tts_base.py and imported at the top of this file
-# They are kept here as comments for reference
-# def get_tts_engine(lang_code='a'):
-#     """Get or create TTS engine instance
-#
-#     Args:
-#         lang_code (str): Language code for the pipeline
-#
-#     Returns:
-#         TTSEngineBase: Initialized TTS engine instance
-#     """
-#     # Implementation moved to tts_base.py
-#     pass
-# def generate_speech(text: str, voice: str = 'af_heart', speed: float = 1.0) -> str:
-#     """Public interface for TTS generation
-#
-#     Args:
-#         text (str): Input text to synthesize
-#         voice (str): Voice ID to use
-#         speed (float): Speech speed multiplier
-#
-#     Returns:
-#         str: Path to generated audio file
-#     "\"""
-#     # Implementation moved to tts_base.py
-#     pass

 # Configure logging
 logger = logging.getLogger(__name__)
+# Import the factory pattern implementation
+from utils.tts_factory import TTSFactory
+# Import base classes
+from utils.tts_base import TTSEngineBase, DummyTTSEngine
+# Import engine-specific modules
+from utils.tts_engines import (
+    get_available_engines,
+    create_engine,
+    KokoroTTSEngine,
+    KokoroSpaceTTSEngine,
+    DiaTTSEngine
+)
+# Import legacy functions for backward compatibility
+from utils.tts_kokoro import generate_speech as kokoro_generate_speech
+from utils.tts_kokoro_space import generate_speech as kokoro_space_generate_speech
+from utils.tts_dia import generate_speech as dia_generate_speech
+# Convenience function to get the best available TTS engine
+def get_best_engine(lang_code: str = 'z') -> TTSEngineBase:
+    """Get the best available TTS engine
+    Args:
+        lang_code (str): Language code for the engine
+    Returns:
+        TTSEngineBase: An instance of the best available TTS engine
+    """
+    return TTSFactory.create_engine(None, lang_code)
+# Legacy function for backward compatibility
+def generate_speech(text: str, language: str = "z", voice: str = "af_heart", speed: float = 1.0) -> str:
+    """Generate speech using the best available TTS engine
+    This is a legacy function maintained for backward compatibility.
+    New code should use the factory pattern implementation directly.
+    Args:
+        text (str): Input text to synthesize
+        language (str): Language code
+        voice (str): Voice ID to use
+        speed (float): Speech speed multiplier
+    Returns:
+        str: Path to the generated audio file
+    """
+    engine = get_best_engine(language)
+    return engine.generate_speech(text, voice, speed)

utils/tts_base.py CHANGED Viewed

@@ -143,10 +143,4 @@ class DummyTTSEngine(TTSEngineBase):
             t = np.linspace(0, duration, int(sample_rate * duration), False)
             freq = 440 + (i * 220)  # Different frequency for each chunk
             tone = np.sin(2 * np.pi * freq * t) * 0.3
-            yield sample_rate, tone
-# Factory functionality moved to tts_factory.py to avoid circular imports
-# Note: Backward compatibility functions moved to tts_factory.py

             t = np.linspace(0, duration, int(sample_rate * duration), False)
             freq = 440 + (i * 220)  # Different frequency for each chunk
             tone = np.sin(2 * np.pi * freq * t) * 0.3
+            yield sample_rate, tone

utils/tts_factory.py CHANGED Viewed

@@ -49,70 +49,4 @@ class TTSFactory:
         # Fall back to dummy engine
         logger.warning("No TTS engines available, falling back to dummy engine")
-        return DummyTTSEngine(lang_code)
-# Backward compatibility function
-def get_tts_engine(lang_code: str = 'a') -> TTSEngineBase:
-    """Get or create TTS engine instance (backward compatibility function)
-    Args:
-        lang_code (str): Language code for the pipeline
-    Returns:
-        TTSEngineBase: Initialized TTS engine instance
-    """
-    logger.info(f"Requesting TTS engine with language code: {lang_code}")
-    try:
-        import streamlit as st
-        logger.info("Streamlit detected, using cached TTS engine")
-        @st.cache_resource
-        def _get_engine():
-            logger.info("Creating cached TTS engine instance")
-            engine = TTSFactory.create_engine(lang_code=lang_code)
-            logger.info(f"Cached TTS engine created with type: {engine.__class__.__name__}")
-            return engine
-        engine = _get_engine()
-        logger.info(f"Retrieved TTS engine from cache with type: {engine.__class__.__name__}")
-        return engine
-    except ImportError:
-        logger.info("Streamlit not available, creating direct TTS engine instance")
-        engine = TTSFactory.create_engine(lang_code=lang_code)
-        logger.info(f"Direct TTS engine created with type: {engine.__class__.__name__}")
-        return engine
-# Backward compatibility function
-def generate_speech(text: str, voice: str = 'af_heart', speed: float = 1.0) -> str:
-    """Public interface for TTS generation (backward compatibility function)
-    Args:
-        text (str): Input text to synthesize
-        voice (str): Voice ID to use
-        speed (float): Speech speed multiplier
-    Returns:
-        str: Path to generated audio file
-    """
-    logger.info(f"Public generate_speech called with text length: {len(text)}, voice: {voice}, speed: {speed}")
-    try:
-        # Get the TTS engine
-        logger.info("Getting TTS engine instance")
-        engine = get_tts_engine()
-        logger.info(f"Using TTS engine type: {engine.__class__.__name__}")
-        # Generate speech
-        logger.info("Calling engine.generate_speech")
-        output_path = engine.generate_speech(text, voice, speed)
-        logger.info(f"Speech generation complete, output path: {output_path}")
-        return output_path
-    except Exception as e:
-        logger.error(f"Error in public generate_speech function: {str(e)}", exc_info=True)
-        logger.error(f"Error type: {type(e).__name__}")
-        if hasattr(e, '__traceback__'):
-            tb = e.__traceback__
-            while tb.tb_next:
-                tb = tb.tb_next
-            logger.error(f"Error occurred in file: {tb.tb_frame.f_code.co_filename}, line {tb.tb_lineno}")
-        raise

         # Fall back to dummy engine
         logger.warning("No TTS engines available, falling back to dummy engine")
+        return DummyTTSEngine(lang_code)

utils/tts_kokoro.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import os
+import time
+import logging
+import numpy as np
+import soundfile as sf
+from typing import Optional, Tuple, Generator
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Constants
+DEFAULT_SAMPLE_RATE = 24000
+# Global model instance (lazy loaded)
+_pipeline = None
+def _get_pipeline(lang_code: str = 'z'):
+    """Lazy-load the Kokoro pipeline to avoid loading it until needed"""
+    global _pipeline
+    if _pipeline is None:
+        logger.info("Loading Kokoro pipeline...")
+        try:
+            # Import Kokoro
+            from kokoro import KPipeline
+            # Initialize the pipeline
+            logger.info(f"Initializing Kokoro pipeline with language code: {lang_code}")
+            _pipeline = KPipeline(lang_code=lang_code)
+            # Log pipeline details
+            logger.info(f"Kokoro pipeline loaded successfully")
+            logger.info(f"Pipeline type: {type(_pipeline).__name__}")
+        except ImportError as import_err:
+            logger.error(f"Import error loading Kokoro pipeline: {import_err}")
+            logger.error(f"This may indicate missing dependencies")
+            raise
+        except Exception as e:
+            logger.error(f"Error loading Kokoro pipeline: {e}", exc_info=True)
+            logger.error(f"Error type: {type(e).__name__}")
+            raise
+    return _pipeline
+def generate_speech(text: str, language: str = "z", voice: str = "af_heart", speed: float = 1.0) -> str:
+    """Public interface for TTS generation using Kokoro model
+    This is a legacy function maintained for backward compatibility.
+    New code should use the factory pattern implementation directly.
+    Args:
+        text (str): Input text to synthesize
+        language (str): Language code ('a' for US English, 'b' for British English,
+                      'j' for Japanese, 'z' for Mandarin Chinese)
+        voice (str): Voice ID to use (e.g., 'af_heart', 'af_bella', etc.)
+        speed (float): Speech speed multiplier (0.5 to 2.0)
+    Returns:
+        str: Path to the generated audio file
+    """
+    logger.info(f"Legacy Kokoro generate_speech called with text length: {len(text)}")
+    # Use the new implementation via factory pattern
+    from utils.tts_engines import KokoroTTSEngine
+    try:
+        # Create a Kokoro engine and generate speech
+        kokoro_engine = KokoroTTSEngine(language)
+        return kokoro_engine.generate_speech(text, voice, speed)
+    except Exception as e:
+        logger.error(f"Error in legacy Kokoro generate_speech: {str(e)}", exc_info=True)
+        # Fall back to dummy TTS
+        from utils.tts_base import DummyTTSEngine
+        dummy_engine = DummyTTSEngine()
+        return dummy_engine.generate_speech(text)
+def generate_speech_stream(text: str, language: str = "z", voice: str = "af_heart", speed: float = 1.0) -> Generator[Tuple[int, np.ndarray], None, None]:
+    """Generate speech stream using Kokoro TTS engine
+    Args:
+        text (str): Input text to synthesize
+        language (str): Language code
+        voice (str): Voice ID to use
+        speed (float): Speech speed multiplier
+    Yields:
+        tuple: (sample_rate, audio_data) pairs for each segment
+    """
+    logger.info(f"Generating speech stream with Kokoro for text length: {len(text)}")
+    try:
+        # Get the Kokoro pipeline
+        pipeline = _get_pipeline(language)
+        # Generate speech stream
+        generator = pipeline(text, voice=voice, speed=speed)
+        for _, _, audio in generator:
+            yield DEFAULT_SAMPLE_RATE, audio
+    except Exception as e:
+        logger.error(f"Error in Kokoro generate_speech_stream: {str(e)}", exc_info=True)
+        # Fall back to dummy TTS
+        from utils.tts_base import DummyTTSEngine
+        dummy_engine = DummyTTSEngine()
+        yield from dummy_engine.generate_speech_stream(text)

utils/tts_kokoro_space.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import os
+import time
+import logging
+import numpy as np
+import soundfile as sf
+from typing import Optional, Tuple, Generator
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Constants
+DEFAULT_SAMPLE_RATE = 24000
+# Global client instance (lazy loaded)
+_client = None
+def _get_client():
+    """Lazy-load the Kokoro Space client to avoid loading it until needed"""
+    global _client
+    if _client is None:
+        logger.info("Loading Kokoro Space client...")
+        try:
+            # Import gradio client
+            from gradio_client import Client
+            # Initialize the client
+            logger.info("Initializing Kokoro Space client")
+            _client = Client("Remsky/Kokoro-TTS-Zero")
+            # Log client details
+            logger.info("Kokoro Space client loaded successfully")
+            logger.info(f"Client type: {type(_client).__name__}")
+        except ImportError as import_err:
+            logger.error(f"Import error loading Kokoro Space client: {import_err}")
+            logger.error("This may indicate missing dependencies")
+            raise
+        except Exception as e:
+            logger.error(f"Error loading Kokoro Space client: {e}", exc_info=True)
+            logger.error(f"Error type: {type(e).__name__}")
+            raise
+    return _client
+def generate_speech(text: str, language: str = "z", voice: str = "af_nova", speed: float = 1.0) -> str:
+    """Public interface for TTS generation using Kokoro Space
+    This is a legacy function maintained for backward compatibility.
+    New code should use the factory pattern implementation directly.
+    Args:
+        text (str): Input text to synthesize
+        language (str): Language code (not used in Kokoro Space, kept for API compatibility)
+        voice (str): Voice ID to use (e.g., 'af_nova', 'af_bella', etc.)
+        speed (float): Speech speed multiplier (0.5 to 2.0)
+    Returns:
+        str: Path to the generated audio file
+    """
+    logger.info(f"Legacy Kokoro Space generate_speech called with text length: {len(text)}")
+    # Use the new implementation via factory pattern
+    from utils.tts_engines import KokoroSpaceTTSEngine
+    try:
+        # Create a Kokoro Space engine and generate speech
+        kokoro_space_engine = KokoroSpaceTTSEngine(language)
+        return kokoro_space_engine.generate_speech(text, voice, speed)
+    except Exception as e:
+        logger.error(f"Error in legacy Kokoro Space generate_speech: {str(e)}", exc_info=True)
+        # Fall back to dummy TTS
+        from utils.tts_base import DummyTTSEngine
+        dummy_engine = DummyTTSEngine()
+        return dummy_engine.generate_speech(text)
+def _create_output_dir() -> str:
+    """Create output directory for audio files
+    Returns:
+        str: Path to the output directory
+    """
+    output_dir = "temp/outputs"
+    os.makedirs(output_dir, exist_ok=True)
+    return output_dir
+def _generate_output_path(prefix: str = "output") -> str:
+    """Generate a unique output path for audio files
+    Args:
+        prefix (str): Prefix for the output filename
+    Returns:
+        str: Path to the output file
+    """
+    output_dir = _create_output_dir()
+    timestamp = int(time.time())
+    return f"{output_dir}/{prefix}_{timestamp}.wav"