|
# XY Tokenizer |
|
|
|
XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate. |
|
|
|
## Features |
|
|
|
- **Dual-channel modeling**: Simultaneously captures semantic meaning and acoustic details |
|
- **Efficient representation**: 1kbps bitrate with RVQ8 quantization at 12.5Hz |
|
- **High-quality audio tokenization**: Convert speech to discrete tokens and back with minimal quality loss |
|
- **Long audio support**: Process audio files longer than 30 seconds using chunking with overlap |
|
- **Batch processing**: Efficiently process multiple audio files in batches |
|
- **24kHz output**: Generate high-quality 24kHz audio output |
|
|
|
## Installation |
|
|
|
```bash |
|
# Create and activate conda environment |
|
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer |
|
|
|
# Install dependencies |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Usage |
|
|
|
### Basic Inference |
|
|
|
To tokenize audio files and reconstruct them: |
|
|
|
```bash |
|
python inference.py \ |
|
--config_path ./config/xy_tokenizer_config.yaml \ |
|
--checkpoint_path ./weights/xy_tokenizer.ckpt \ |
|
--input_dir ./input_wavs/ \ |
|
--output_dir ./output_wavs/ |
|
``` |
|
|
|
### Parameters |
|
|
|
- `--config_path`: Path to the model configuration file |
|
- `--checkpoint_path`: Path to the pre-trained model checkpoint |
|
- `--input_dir`: Directory containing input WAV files |
|
- `--output_dir`: Directory to save reconstructed audio files |
|
- `--device`: Device to run inference on (default: "cuda") |
|
- `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default) |
|
|
|
## Project Structure |
|
|
|
- `xy_tokenizer/`: Core model implementation |
|
- `model.py`: Main XY_Tokenizer model class |
|
- `nn/`: Neural network components |
|
- `config/`: Configuration files |
|
- `utils/`: Utility functions |
|
- `weights/`: Pre-trained model weights |
|
- `input_wavs/`: Directory for input audio files |
|
- `output_wavs/`: Directory for output audio files |
|
|
|
## Model Architecture |
|
|
|
XY Tokenizer uses a dual-channel architecture that simultaneously models: |
|
1. **Semantic Channel**: Captures high-level semantic information and linguistic content |
|
2. **Acoustic Channel**: Preserves detailed acoustic features including speaker characteristics and prosody |
|
|
|
The model processes audio through several stages: |
|
1. Feature extraction (mel-spectrogram) |
|
2. Parallel semantic and acoustic encoding |
|
3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps) |
|
4. Decoding and waveform generation |
|
|
|
## License |
|
|
|
[Specify your license here] |