A newer version of the Gradio SDK is available:
5.35.0
XY Tokenizer
XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.
Features
- Dual-channel modeling: Simultaneously captures semantic meaning and acoustic details
- Efficient representation: 1kbps bitrate with RVQ8 quantization at 12.5Hz
- High-quality audio tokenization: Convert speech to discrete tokens and back with minimal quality loss
- Long audio support: Process audio files longer than 30 seconds using chunking with overlap
- Batch processing: Efficiently process multiple audio files in batches
- 24kHz output: Generate high-quality 24kHz audio output
Installation
# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
# Install dependencies
pip install -r requirements.txt
Usage
Basic Inference
To tokenize audio files and reconstruct them:
python inference.py \
--config_path ./config/xy_tokenizer_config.yaml \
--checkpoint_path ./weights/xy_tokenizer.ckpt \
--input_dir ./input_wavs/ \
--output_dir ./output_wavs/
Parameters
--config_path
: Path to the model configuration file--checkpoint_path
: Path to the pre-trained model checkpoint--input_dir
: Directory containing input WAV files--output_dir
: Directory to save reconstructed audio files--device
: Device to run inference on (default: "cuda")--debug
,--debug_ip
,--debug_port
: Debugging options (disabled by default)
Project Structure
xy_tokenizer/
: Core model implementationmodel.py
: Main XY_Tokenizer model classnn/
: Neural network components
config/
: Configuration filesutils/
: Utility functionsweights/
: Pre-trained model weightsinput_wavs/
: Directory for input audio filesoutput_wavs/
: Directory for output audio files
Model Architecture
XY Tokenizer uses a dual-channel architecture that simultaneously models:
- Semantic Channel: Captures high-level semantic information and linguistic content
- Acoustic Channel: Preserves detailed acoustic features including speaker characteristics and prosody
The model processes audio through several stages:
- Feature extraction (mel-spectrogram)
- Parallel semantic and acoustic encoding
- Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
- Decoding and waveform generation
License
[Specify your license here]