yhzx233's picture
feat: app.py
ea174b0

A newer version of the Gradio SDK is available: 5.35.0

Upgrade

XY Tokenizer

XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.

Features

  • Dual-channel modeling: Simultaneously captures semantic meaning and acoustic details
  • Efficient representation: 1kbps bitrate with RVQ8 quantization at 12.5Hz
  • High-quality audio tokenization: Convert speech to discrete tokens and back with minimal quality loss
  • Long audio support: Process audio files longer than 30 seconds using chunking with overlap
  • Batch processing: Efficiently process multiple audio files in batches
  • 24kHz output: Generate high-quality 24kHz audio output

Installation

# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer

# Install dependencies
pip install -r requirements.txt

Usage

Basic Inference

To tokenize audio files and reconstruct them:

python inference.py \
  --config_path ./config/xy_tokenizer_config.yaml \
  --checkpoint_path ./weights/xy_tokenizer.ckpt \
  --input_dir ./input_wavs/ \
  --output_dir ./output_wavs/

Parameters

  • --config_path: Path to the model configuration file
  • --checkpoint_path: Path to the pre-trained model checkpoint
  • --input_dir: Directory containing input WAV files
  • --output_dir: Directory to save reconstructed audio files
  • --device: Device to run inference on (default: "cuda")
  • --debug, --debug_ip, --debug_port: Debugging options (disabled by default)

Project Structure

  • xy_tokenizer/: Core model implementation
    • model.py: Main XY_Tokenizer model class
    • nn/: Neural network components
  • config/: Configuration files
  • utils/: Utility functions
  • weights/: Pre-trained model weights
  • input_wavs/: Directory for input audio files
  • output_wavs/: Directory for output audio files

Model Architecture

XY Tokenizer uses a dual-channel architecture that simultaneously models:

  1. Semantic Channel: Captures high-level semantic information and linguistic content
  2. Acoustic Channel: Preserves detailed acoustic features including speaker characteristics and prosody

The model processes audio through several stages:

  1. Feature extraction (mel-spectrogram)
  2. Parallel semantic and acoustic encoding
  3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
  4. Decoding and waveform generation

License

[Specify your license here]