XY Tokenizer

XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.

Features

Dual-channel modeling: Simultaneously captures semantic meaning and acoustic details
Efficient representation: 1kbps bitrate with RVQ8 quantization at 12.5Hz
High-quality audio tokenization: Convert speech to discrete tokens and back with minimal quality loss
Long audio support: Process audio files longer than 30 seconds using chunking with overlap
Batch processing: Efficiently process multiple audio files in batches
24kHz output: Generate high-quality 24kHz audio output

Installation

# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer

# Install dependencies
pip install -r requirements.txt

Usage

Basic Inference

To tokenize audio files and reconstruct them:

python inference.py \
  --config_path ./config/xy_tokenizer_config.yaml \
  --checkpoint_path ./weights/xy_tokenizer.ckpt \
  --input_dir ./input_wavs/ \
  --output_dir ./output_wavs/

Parameters

--config_path: Path to the model configuration file
--checkpoint_path: Path to the pre-trained model checkpoint
--input_dir: Directory containing input WAV files
--output_dir: Directory to save reconstructed audio files
--device: Device to run inference on (default: "cuda")
--debug, --debug_ip, --debug_port: Debugging options (disabled by default)

Project Structure

xy_tokenizer/: Core model implementation
- model.py: Main XY_Tokenizer model class
- nn/: Neural network components
config/: Configuration files
utils/: Utility functions
weights/: Pre-trained model weights
input_wavs/: Directory for input audio files
output_wavs/: Directory for output audio files

Model Architecture

XY Tokenizer uses a dual-channel architecture that simultaneously models:

Semantic Channel: Captures high-level semantic information and linguistic content
Acoustic Channel: Preserves detailed acoustic features including speaker characteristics and prosody

The model processes audio through several stages:

Feature extraction (mel-spectrogram)
Parallel semantic and acoustic encoding
Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
Decoding and waveform generation

License

[Specify your license here]