Spaces:

fnlp
/

MOSS-TTSD

Running on Zero

File size: 2,631 Bytes

ea174b0

# XY Tokenizer

XY Tokenizer is a speech codec that simultaneously models both semantic and acoustic aspects of speech, converting audio into discrete tokens and decoding them back to high-quality audio. It achieves efficient speech representation at only 1kbps with RVQ8 quantization at 12.5Hz frame rate.

## Features

- **Dual-channel modeling**: Simultaneously captures semantic meaning and acoustic details
- **Efficient representation**: 1kbps bitrate with RVQ8 quantization at 12.5Hz
- **High-quality audio tokenization**: Convert speech to discrete tokens and back with minimal quality loss
- **Long audio support**: Process audio files longer than 30 seconds using chunking with overlap
- **Batch processing**: Efficiently process multiple audio files in batches
- **24kHz output**: Generate high-quality 24kHz audio output

## Installation

```bash
# Create and activate conda environment
conda create -n xy_tokenizer python=3.10 -y && conda activate xy_tokenizer

# Install dependencies
pip install -r requirements.txt
```

## Usage

### Basic Inference

To tokenize audio files and reconstruct them:

```bash
python inference.py \
  --config_path ./config/xy_tokenizer_config.yaml \
  --checkpoint_path ./weights/xy_tokenizer.ckpt \
  --input_dir ./input_wavs/ \
  --output_dir ./output_wavs/
```

### Parameters

- `--config_path`: Path to the model configuration file
- `--checkpoint_path`: Path to the pre-trained model checkpoint
- `--input_dir`: Directory containing input WAV files
- `--output_dir`: Directory to save reconstructed audio files
- `--device`: Device to run inference on (default: "cuda")
- `--debug`, `--debug_ip`, `--debug_port`: Debugging options (disabled by default)

## Project Structure

- `xy_tokenizer/`: Core model implementation
  - `model.py`: Main XY_Tokenizer model class
  - `nn/`: Neural network components
- `config/`: Configuration files
- `utils/`: Utility functions
- `weights/`: Pre-trained model weights
- `input_wavs/`: Directory for input audio files
- `output_wavs/`: Directory for output audio files

## Model Architecture

XY Tokenizer uses a dual-channel architecture that simultaneously models:
1. **Semantic Channel**: Captures high-level semantic information and linguistic content
2. **Acoustic Channel**: Preserves detailed acoustic features including speaker characteristics and prosody

The model processes audio through several stages:
1. Feature extraction (mel-spectrogram)
2. Parallel semantic and acoustic encoding
3. Residual Vector Quantization (RVQ8) at 12.5Hz frame rate (1kbps)
4. Decoding and waveform generation

## License

[Specify your license here]