Spaces:
Build error
Build error
| # Speech2C | |
| > [**Speech2C**](https://arxiv.org/abs/2203.17113) (```INTERSPEECH 2022```): **Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data** | |
| ## Pre-Trained and Fine-tuned Models | |
| | Model | Pre-training Dataset | Fine-tuning Dataset | Model | | |
| | :------: | :----------------------------------------------: | :-----------------: | :-----: | | |
| | Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | - | [Google Drive](https://drive.google.com/file/d/1nGZ0LWEwlLq2pz7o805YALsMr9irV0Za/view?usp=sharing) | | |
| | Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [10 hrs LibriSpeech](http://www.openslr.org/12) | [Google Drive](https://drive.google.com/file/d/1nWSAc-33LmcDQHzH8IjXVJsuk0JZTWgN/view?usp=sharing) | | |
| | Speech2C | [960 hrs LibriSpeech](http://www.openslr.org/12) | [100 hrs LibriSpeech](http://www.openslr.org/12) | [Google Drive](https://drive.google.com/file/d/1LwbQ5Y3tKZoK3s1ayLQgsfLTFnmkKNZs/view?usp=sharing) | | |
| ## Language Model and Vocabulary | |
| | Model | Dataset | Model | Vocabulary | | |
| | :------: | :------: | :---: | :--------: | | |
| | LM | [LibriSpeech LM Dataset](https://www.openslr.org/11/) | [Model](https://drive.google.com/file/d/1UDCcNJT1DlquSRw0iRAXH6GHlf6zK6-8/view?usp=sharing) | [Vocabulary](https://dl.fbaipublicfiles.com/fairseq/wav2vec/dict.ltr.txt) | | |
| ## Setup | |
| ``` | |
| git submodule update --init Speech2C/fairseq | |
| cd Speech2C/ | |
| pip install --editable fairseq/ | |
| ``` | |
| ## Data Preparation | |
| Please follow the steps of data preparation for HuBERT in [here](https://github.com/facebookresearch/fairseq/tree/main/examples/hubert#data-preparation). | |
| ## Pre-Training | |
| ``` | |
| DATA_DIR= | |
| LABEL_DIR= | |
| FAIRSEQ_PATH= | |
| python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \ | |
| --config-dir speech2c/config \ | |
| --config-name speech2c_base_librispeech \ | |
| task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} task.labels='["km"]' \ | |
| model.label_rate=50 common.user_dir=SpeechT5/Speech2C/speech2c \ | |
| ``` | |
| ## Finetune | |
| ``` | |
| DATA_DIR= | |
| LABEL_DIR= | |
| FAIRSEQ_PATH= | |
| W2V_PATH= | |
| CONFIG_NAME= | |
| python ${FAIRSEQ_PATH}/fairseq_cli/hydra_train.py \ | |
| --config-dir speech2c/config \ | |
| --config-name ${CONFIG_NAME} \ | |
| task.data=${DATA_DIR} task.label_dir=${LABEL_DIR} \ | |
| model.w2v_path=${W2V_PATH} common.user_dir=SpeechT5/Speech2C/speech2c \ | |
| ``` | |
| ## Inference | |
| Note that joint CTC and decoder inference is only supported when the batch size is 1. | |
| ``` | |
| FAIRSEQ_PATH= | |
| DATA_DIR= | |
| LABEL_DIR= | |
| BEAM_SIZE= | |
| CTC_WEIGHT= | |
| TEST_SET= | |
| CHECKPOINT_PATH= | |
| W2V_PATH= | |
| python ${FAIRSEQ_PATH}/fairseq_cli/generate.py ${DATA_DIR} \ | |
| --label-dir ${LABEL_DIR} \ | |
| --path ${CHECKPOINT_PATH} \ | |
| --user-dir SpeechT5/Speech2C/speech2c \ | |
| --model-overrides "{'w2v_path': '${W2V_PATH}'}" \ | |
| --gen-subset ${TEST_SET} \ | |
| --task speech2c_pretraining \ | |
| --post-process letter \ | |
| --add-decoder \ | |
| --labels '["ltr"]' \ | |
| --fine-tuning \ | |
| --scoring wer \ | |
| --max-len-a 0 \ | |
| --max-len-b 620 \ | |
| --pad-audio \ | |
| --random-crop \ | |
| --ctc-weight ${CTC_WEIGHT} \ | |
| --max-tokens 8000000 \ | |
| --beam ${BEAM_SIZE} \ | |
| --single-target \ | |
| ``` | |
| ## Results on Librispeech | |
| ### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 10hr subset | |
| | Model |LM | test-clean | test-other | | |
| | ------------- |------------- | ----| ----| | |
| | wav2vec2.0 Base | - | 11.1 | 17.6 | | |
| | HuBERT Base | - | 10.1 | 16.8 | | |
| | **Speech2C** | - | **7.8** | **13.1** | | |
| | wav2vec 2.0 Base | 4-gram | 4.3 |9.5 | | |
| | wav2vec 2.0 Base | Transf. |3.2 |7.8 | | |
| | HuBERT Base | 4-gram |4.3 |9.4 | | |
| | **Speech2C** | **Transf.** | **3.1** | **7.0** | | |
| ### Evaluation on the [LibriSpeech](http://www.openslr.org/12) 100hr subset | |
| | Model |LM | test-clean | test-other | | |
| | ------------- |------------- | ----| ----| | |
| | wav2vec2.0 Base | - | 6.1 | 13.3 | | |
| | wav2vec2.0 Large | - | 4.7 | 9.0 | | |
| | HuBERT Base | - | 6.3 | 13.2 | | |
| | SpeechT5 | - | 4.4 | 10.4 | | |
| | Baseline | - | 5.0 | 11.9 | | |
| | **Speech2C** | - | **4.3** |**9.0** | | |
| | wav2vec 2.0 Base | 4-gram | 3.4 |8.0 | | |
| | wav2vec 2.0 Base | Transf. | 2.6 | 6.3 | | |
| | HuBERT Base | 4-gram | 3.4 |8.1 | | |
| | SpeechT5 | Transf. | 2.4 |5.8 | | |
| | Baseline | Transf. | 2.5 |6.3 | | |
| | **Speech2C** | **Transf.** | **2.4** |**5.2** | | |
| ## License | |
| This project is licensed under the license found in the LICENSE file in the root directory of this source tree. | |
| Portions of the source code are based on the [FAIRSEQ](https://github.com/pytorch/fairseq). | |
| [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct) | |
| ## Reference | |
| If you find our work is useful in your research, please cite the following paper: | |
| ```bibtex | |
| @article{Ao2022Speech2C, | |
| title = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data}, | |
| author = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei}, | |
| eprint={2203.17113}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.SD}, | |
| year={2022} | |
| } | |
| ``` | |