Update README.md

7accec1 about 3 years ago

6.26 kB

	---
	language: de
	datasets:
	- common_voice
	inference: false
	metrics:
	- wer
	- cer
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- hf-asr-leaderboard
	license: apache-2.0
	model-index:
	- name: wav2vec 2.0 XLS-R 1B + TEVR tokens + 5-gram LM by Hajo Nils Krabbenhöft
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice de
	type: common_voice
	args: de
	metrics:
	- name: Test WER
	type: wer
	value: 3.6433399042523233
	- name: Test CER
	type: cer
	value: 1.5398893560981173
	---


	## Overview

	This folder contains a fully trained German speech recognition pipeline
	consisting of an acoustic model using the new wav2vec 2.0 XLS-R 1B TEVR architecture
	and a 5-gram KenLM language model.
	For an explanation of the TEVR enhancements and their motivation, please see our paper:
	[TEVR: Improving Speech Recognition by Token Entropy Variance Reduction](https://arxiv.org/abs/2206.12693).


	[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/tevr-improving-speech-recognition-by-token/speech-recognition-on-common-voice-german)](https://paperswithcode.com/sota/speech-recognition-on-common-voice-german?p=tevr-improving-speech-recognition-by-token)
	This pipeline scores a very competitive (as of June 2022) word error rate of 3.64% on CommonVoice German.
	The character error rate was 1.54%.

	## Citation

	If you use this ASR pipeline for research, please cite:
	```bibtex
	@misc{https://doi.org/10.48550/arxiv.2206.12693,
	doi = {10.48550/ARXIV.2206.12693},
	url = {https://arxiv.org/abs/2206.12693},
	author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},
	keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},
	title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},
	publisher = {arXiv},
	year = {2022},
	copyright = {Creative Commons Attribution 4.0 International}
	}
	```

	## TEVR Tokenizer Creation / Testing

	See https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de for:
	- our trained ByT5 model used to calculate the entropies in the paper
	- a Jupyter Notebook to generate a TEVR Tokenizer from a text corpus
	- a Jupyter Notebook to generate the illustration image in the paper

	## Evaluation

	To evalue this pipeline yourself and/or on your own data, see the `HF Eval Script.ipynb` Jupyter Notebook
	or use the following python script:



	```python
	!pip install --quiet --root-user-action=ignore --upgrade pip
	!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub
	!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
	!pip install --quiet --root-user-action=ignore --upgrade transformers
	!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations
	```


	```python
	from datasets import load_dataset, Audio, load_metric
	from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
	import torchaudio.transforms as T
	import torch
	import unicodedata
	import numpy as np
	import re

	# load testing dataset
	testing_dataset = load_dataset("common_voice", "de", split="test")

	# replace invisible characters with space
	allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
	map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
	replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')

	def text_fix(text):
	# change ß to ss
	text = text.replace('ß','ss')
	# convert dash to space and remove double-space
	text = text.replace('-',' ').replace(' ',' ').replace(' ',' ')
	# make lowercase
	text = text.lower()
	# remap all invisible characters to space
	text = text.translate(replacements).strip()
	# for easier comparison to Zimmermeister, replace unrepresentable characters with ?
	text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
	# remove multiple spaces (again)
	text = ' '.join([w for w in text.split(' ') if w != ''])
	return text

	# load model
	model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
	model.to('cuda')
	# load processor
	class HajoProcessor(Wav2Vec2ProcessorWithLM):
	@staticmethod
	def get_missing_alphabet_tokens(decoder, tokenizer):
	return []
	processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")

	# this function will be called for each WAV file
	def predict_single_audio(batch, image=False):
	audio = batch['audio']['array']
	# resample, if needed
	if batch['audio']['sampling_rate'] != 16000:
	audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
	# normalize
	audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
	# ask HF processor to prepare audio for GPU eval
	input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
	# call model on GPU
	with torch.no_grad():
	logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
	# ask HF processor to decode logits
	decoded = processor.decode(logits, beam_width=500)
	# return as dictionary
	return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }

	# process all audio files
	all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)

	# print results
	print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
	print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
	```

	WER 3.6433399042523233 %
	CER 1.5398893560981173 %