Spaces:
Runtime error
Runtime error
# Speech2C | |
> [**Speech2C**](https://arxiv.org/abs/2203.17113) (```INTERSPEECH 2022```): **Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data** | |
## Pre-Trained and Fine-tuned Models | |
| Model | Pre-training Dataset | Fine-tuning Dataset | Model | | |
| :------: | :----------------------------------------------: | :-----------------: | :-----: | | |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | - | [Google Drive](https://drive.google.com/file/d/1nGZ0LWEwlLq2pz7o805YALsMr9irV0Za/view?usp=sharing) | | |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [10 hrs LibriSpeech](http://www.openslr.org/12) | [Google Drive](https://drive.google.com/file/d/1nWSAc-33LmcDQHzH8IjXVJsuk0JZTWgN/view?usp=sharing) | | |
| Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [100 hrs LibriSpeech](http://www.openslr.org/12) | [Google Drive](https://drive.google.com/file/d/1LwbQ5Y3tKZoK3s1ayLQgsfLTFnmkKNZs/view?usp=sharing) | | |
## Language Model and Vocabulary | |
| Model | Dataset | Model | Vocabulary | | |
| :------: | :------: | :---: | :--------: | | |
| LM | [LibriSpeech LM Dataset](https://www.openslr.org/11/) | [Model](https://drive.google.com/file/d/1UDCcNJT1DlquSRw0iRAXH6GHlf6zK6-8/view?usp=sharing) | [Vocabulary](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt) | | |
## Setup | |
``` | |
git submodule update --init Speech2C/fairseq | |
cd Speech2C/ | |
pip install --editable fairseq/ | |
``` | |
## Data Preparation | |
Please follow the steps of data preparation for HuBERT in [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation). | |
## Pre-Training | |
``` | |
DATA_DIR= | |
LABEL_DIR= | |
FAIRSEQ_PATH= | |
python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \ | |
--config-dir speech2c/config \ | |
--config-name speech2c_base_librispeech \ | |
task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} task.labels='["km"]' \ | |
model.label_rate=50 common.user_dir=SpeechT5/Speech2C/speech2c \ | |
``` | |
## Finetune | |
``` | |
DATA_DIR= | |
LABEL_DIR= | |
FAIRSEQ_PATH= | |
W2V_PATH= | |
CONFIG_NAME= | |
python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \ | |
--config-dir speech2c/config \ | |
--config-name ${CONFIG_NAME} \ | |
task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} \ | |
model.w2v_path=${W2V_PATH} common.user_dir=SpeechT5/Speech2C/speech2c \ | |
``` | |
## Inference | |
Note that joint CTC and decoder inference is only supported when the batch size is 1. | |
``` | |
FAIRSEQ_PATH= | |
DATA_DIR= | |
LABEL_DIR= | |
BEAM_SIZE= | |
CTC_WEIGHT= | |
TEST_SET= | |
CHECKPOINT_PATH= | |
W2V_PATH= | |
python ${FAIRSEQ_PATH}/fairseq_cli/generate.py ${DATA_DIR} \ | |
--label-dir ${LABEL_DIR} \ | |
--path ${CHECKPOINT_PATH} \ | |
--user-dir SpeechT5/Speech2C/speech2c \ | |
--model-overrides "{'w2v_path': '${W2V_PATH}'}" \ | |
--gen-subset ${TEST_SET} \ | |
--task speech2c_pretraining \ | |
--post-process letter \ | |
--add-decoder \ | |
--labels '["ltr"]' \ | |
--fine-tuning \ | |
--scoring wer \ | |
--max-len-a 0 \ | |
--max-len-b 620 \ | |
--pad-audio \ | |
--random-crop \ | |
--ctc-weight ${CTC_WEIGHT} \ | |
--max-tokens 8000000 \ | |
--beam ${BEAM_SIZE} \ | |
--single-target \ | |
``` | |
## Results on Librispeech | |
### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 10hr subset | |
| Model |LM | test-clean | test-other | | |
| ------------- |------------- | ----| ----| | |
| wav2vec2.0 Base | - | 11.1 | 17.6 | | |
| HuBERT Base | - | 10.1 | 16.8 | | |
| **Speech2C** | - | **7.8** | **13.1** | | |
| wav2vec 2.0 Base | 4-gram | 4.3 |9.5 | | |
| wav2vec 2.0 Base | Transf. |3.2 |7.8 | | |
| HuBERT Base | 4-gram |4.3 |9.4 | | |
| **Speech2C** | **Transf.** | **3.1** | **7.0** | | |
### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 100hr subset | |
| Model |LM | test-clean | test-other | | |
| ------------- |------------- | ----| ----| | |
| wav2vec2.0 Base | - | 6.1 | 13.3 | | |
| wav2vec2.0 Large | - | 4.7 | 9.0 | | |
| HuBERT Base | - | 6.3 | 13.2 | | |
| SpeechT5 | - | 4.4 | 10.4 | | |
| Baseline | - | 5.0 | 11.9 | | |
| **Speech2C** | - | **4.3** |**9.0** | | |
| wav2vec 2.0 Base | 4-gram | 3.4 |8.0 | | |
| wav2vec 2.0 Base | Transf. | 2.6 | 6.3 | | |
| HuBERT Base | 4-gram | 3.4 |8.1 | | |
| SpeechT5 | Transf. | 2.4 |5.8 | | |
| Baseline | Transf. | 2.5 |6.3 | | |
| **Speech2C** | **Transf.** | **2.4** |**5.2** | | |
## License | |
This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq). | |
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
## Reference | |
If you find our work is useful in your research, please cite the following paper: | |
```bibtex | |
@article{Ao2022Speech2C, | |
title = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data}, | |
author = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei}, | |
eprint={2203.17113}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.SD}, | |
year={2022} | |
} | |
``` | |