File size: 2,631 Bytes
ea174b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
# XY Tokenizer
XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.
## Features
- **Dual-channel modeling**: Simultaneously captures semantic meaning and acoustic details
- **Efficient representation**: 1kbps bitrate with RVQ8 quantization at 12.5Hz
- **High-quality audio tokenization**: Convert speech to discrete tokens and back with minimal quality loss
- **Long audio support**: Process audio files longer than 30 seconds using chunking with overlap
- **Batch processing**: Efficiently process multiple audio files in batches
- **24kHz output**: Generate high-quality 24kHz audio output
## Installation
```bash
# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer
# Install dependencies
pip install -r requirements.txt
```
## Usage
### Basic Inference
To tokenize audio files and reconstruct them:
```bash
python inference.py \
--config_path ./config/xy_tokenizer_config.yaml \
--checkpoint_path ./weights/xy_tokenizer.ckpt \
--input_dir ./input_wavs/ \
--output_dir ./output_wavs/
```
### Parameters
- `--config_path`: Path to the model configuration file
- `--checkpoint_path`: Path to the pre-trained model checkpoint
- `--input_dir`: Directory containing input WAV files
- `--output_dir`: Directory to save reconstructed audio files
- `--device`: Device to run inference on (default: "cuda")
- `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default)
## Project Structure
- `xy_tokenizer/`: Core model implementation
- `model.py`: Main XY_Tokenizer model class
- `nn/`: Neural network components
- `config/`: Configuration files
- `utils/`: Utility functions
- `weights/`: Pre-trained model weights
- `input_wavs/`: Directory for input audio files
- `output_wavs/`: Directory for output audio files
## Model Architecture
XY Tokenizer uses a dual-channel architecture that simultaneously models:
1. **Semantic Channel**: Captures high-level semantic information and linguistic content
2. **Acoustic Channel**: Preserves detailed acoustic features including speaker characteristics and prosody
The model processes audio through several stages:
1. Feature extraction (mel-spectrogram)
2. Parallel semantic and acoustic encoding
3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
4. Decoding and waveform generation
## License
[Specify your license here] |