|
--- |
|
language: de |
|
datasets: |
|
- common_voice |
|
inference: false |
|
metrics: |
|
- wer |
|
- cer |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- speech |
|
- hf-asr-leaderboard |
|
license: apache-2.0 |
|
model-index: |
|
- name: wav2vec 2.0 XLS-R 1B + TEVR tokens + 5-gram LM by Hajo Nils Krabbenhöft |
|
results: |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice de |
|
type: common_voice |
|
args: de |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 3.6433399042523233 |
|
- name: Test CER |
|
type: cer |
|
value: 1.5398893560981173 |
|
--- |
|
|
|
|
|
## Overview |
|
|
|
This folder contains a fully trained German speech recognition pipeline |
|
consisting of an acoustic model using the new wav2vec 2.0 XLS-R 1B **TEVR** architecture |
|
and a 5-gram KenLM language model. |
|
For an explanation of the TEVR enhancements and their motivation, please see our paper: |
|
[TEVR: Improving Speech Recognition by Token Entropy Variance Reduction](https://arxiv.org/abs/2206.12693). |
|
|
|
|
|
[](https://paperswithcode.com/sota/speech-recognition-on-common-voice-german?p=tevr-improving-speech-recognition-by-token) |
|
This pipeline scores a very competitive (as of June 2022) **word error rate of 3.64%** on CommonVoice German. |
|
The character error rate was 1.54%. |
|
|
|
## Citation |
|
|
|
If you use this ASR pipeline for research, please cite: |
|
```bibtex |
|
@misc{https://doi.org/10.48550/arxiv.2206.12693, |
|
doi = {10.48550/ARXIV.2206.12693}, |
|
url = {https://arxiv.org/abs/2206.12693}, |
|
author = {Krabbenhöft, Hajo Nils and Barth, Erhardt}, |
|
keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7}, |
|
title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction}, |
|
publisher = {arXiv}, |
|
year = {2022}, |
|
copyright = {Creative Commons Attribution 4.0 International} |
|
} |
|
``` |
|
|
|
## TEVR Tokenizer Creation / Testing |
|
|
|
See https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de for: |
|
- our trained ByT5 model used to calculate the entropies in the paper |
|
- a Jupyter Notebook to generate a TEVR Tokenizer from a text corpus |
|
- a Jupyter Notebook to generate the illustration image in the paper |
|
|
|
## Evaluation |
|
|
|
To evalue this pipeline yourself and/or on your own data, see the `HF Eval Script.ipynb` Jupyter Notebook |
|
or use the following python script: |
|
|
|
|
|
|
|
```python |
|
!pip install --quiet --root-user-action=ignore --upgrade pip |
|
!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub |
|
!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode |
|
!pip install --quiet --root-user-action=ignore --upgrade transformers |
|
!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations |
|
``` |
|
|
|
|
|
```python |
|
from datasets import load_dataset, Audio, load_metric |
|
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM |
|
import torchaudio.transforms as T |
|
import torch |
|
import unicodedata |
|
import numpy as np |
|
import re |
|
|
|
# load testing dataset |
|
testing_dataset = load_dataset("common_voice", "de", split="test") |
|
|
|
# replace invisible characters with space |
|
allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)])) |
|
map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-'] |
|
replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ') |
|
|
|
def text_fix(text): |
|
# change ß to ss |
|
text = text.replace('ß','ss') |
|
# convert dash to space and remove double-space |
|
text = text.replace('-',' ').replace(' ',' ').replace(' ',' ') |
|
# make lowercase |
|
text = text.lower() |
|
# remap all invisible characters to space |
|
text = text.translate(replacements).strip() |
|
# for easier comparison to Zimmermeister, replace unrepresentable characters with ? |
|
text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text) |
|
# remove multiple spaces (again) |
|
text = ' '.join([w for w in text.split(' ') if w != '']) |
|
return text |
|
|
|
# load model |
|
model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr") |
|
model.to('cuda') |
|
# load processor |
|
class HajoProcessor(Wav2Vec2ProcessorWithLM): |
|
@staticmethod |
|
def get_missing_alphabet_tokens(decoder, tokenizer): |
|
return [] |
|
processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr") |
|
|
|
# this function will be called for each WAV file |
|
def predict_single_audio(batch, image=False): |
|
audio = batch['audio']['array'] |
|
# resample, if needed |
|
if batch['audio']['sampling_rate'] != 16000: |
|
audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy() |
|
# normalize |
|
audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7) |
|
# ask HF processor to prepare audio for GPU eval |
|
input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values |
|
# call model on GPU |
|
with torch.no_grad(): |
|
logits = model(input_values.to('cuda')).logits.cpu().numpy()[0] |
|
# ask HF processor to decode logits |
|
decoded = processor.decode(logits, beam_width=500) |
|
# return as dictionary |
|
return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text } |
|
|
|
# process all audio files |
|
all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names) |
|
|
|
# print results |
|
print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%') |
|
print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%') |
|
``` |
|
|
|
WER 3.6433399042523233 % |
|
CER 1.5398893560981173 % |
|
|
|
|