Update README.md

d684761 verified about 2 months ago

5.96 kB

	---
	license: apache-2.0
	language:
	- sl
	- hr
	- sr
	- cs
	- pl
	base_model:
	- facebook/w2v-bert-2.0
	pipeline_tag: audio-classification
	metrics:
	- f1
	- recall
	- precision
	---


	# Frame classification for filled pauses

	This model classifies individual 20ms frames of audio based on
	presence of filled pauses ("eee", "errm", ...).


	# Training data

	The model was trained on human-annotated Slovenian speech corpus
	[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
	at most 30s long chunks.

	# Evaluation

	Although the output of the model is a series 0 or 1, describing their 20ms frames,
	the evaluation was done on event level; spans of consecutive outputs 1 were
	bundled together into one event. When the true and predicted
	events partially overlap, this is counted as a true positive.
	We report precisions, recalls, and F1-scores of the positive class.

	## Evaluation on ROG corpus



	\| postprocessing \| recall \| precision \| F1 \|
	\|------:\|---------:\|------------:\|------:\|
	\|none\| 0.981 \| 0.955 \| 0.968 \|


	## Evaluation on ParlaSpeech corpora

	For every language in the
	[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
	400 instances were sampled and annotated by human annotators.


	Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
	we observed a few failure modes when inferring. It was discovered
	that post-processing can be used to improve results. False positives
	were observed to be caused by improper audio segmentation, which is
	why disabling predictions that start at the start of the audio or
	end at the end of the audio can be beneficial. Another failure mode
	is predicting very short events, which is why ignoring very short predictions
	can be safely discarded.

	With added postprocessing, the model achieves the following metrics:


	\| lang \| postprocessing \| recall \| precision \| F1 \|
	\|:-------\|:-----------------------\|---------:\|------------:\|------:\|
	\| CZ \| drop_short_initial_and_final \| 0.889 \| 0.859 \| 0.874 \|
	\| HR \| drop_short_initial_and_final \| 0.94 \| 0.887 \| 0.913 \|
	\| PL \| drop_short_initial_and_final \| 0.903 \| 0.947 \| 0.924 \|
	\| RS \| drop_short_initial_and_final \| 0.966 \| 0.915 \| 0.94 \|

	Fop details on postprocessing see function `frames_to_intervals` in the code snippet below.

	# Example use:
	```python

	from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
	from datasets import Dataset, Audio
	import torch
	import numpy as np
	from pathlib import Path

	device = torch.device("cuda")
	model_name = "classla/wav2vecbert2-filledPause"
	feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
	model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)

	ds = Dataset.from_dict(
	{
	"audio": [
	"/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
	],
	}
	).cast_column("audio", Audio(sampling_rate=16_000, mono=True))


	def frames_to_intervals(
	frames: list[int],
	drop_short=True,
	drop_initial=True,
	drop_final=True,
	short_cutoff_s=0.08,
	) -> list[tuple[float]]:
	"""Transforms a list of ones or zeros, corresponding to annotations on frame
	levels, to a list of intervals ([start second, end second]).

	Allows for additional filtering on duration (false positives are often
	short) and start times (false positives starting at 0.0 are often an
	artifact of poor segmentation).

	:param list[int] frames: Input frame labels
	:param bool drop_short: Drop everything shorter than short_cutoff_s,
	defaults to True
	:param bool drop_initial: Drop predictions starting at 0.0, defaults to True
	:param bool drop_final: Drop predictions ending at audio end, defaults to True
	:param float short_cutoff_s: Duration in seconds of shortest allowable
	prediction, defaults to 0.08

	:return list[tuple[float]]: List of intervals [start_s, end_s]
	"""
	from itertools import pairwise
	import pandas as pd

	results = []
	ndf = pd.DataFrame(
	data={
	"time_s": [0.020 * i for i in range(len(frames))],
	"frames": frames,
	}
	)
	ndf = ndf.dropna()
	indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
	for si, ei in pairwise(indices_of_change):
	if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
	pass
	else:
	results.append(
	(
	round(ndf.loc[si, "time_s"], 3),
	round(ndf.loc[ei, "time_s"], 3),
	)
	)
	if drop_short and (len(results) > 0):
	results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
	if drop_initial and (len(results) > 0):
	results = [i for i in results if i[0] != 0.0]
	if drop_final and (len(results) > 0):
	results = [i for i in results if i[1] != 0.02 * len(frames)]
	return results


	def evaluator(chunks):
	sampling_rate = chunks["audio"][0]["sampling_rate"]
	with torch.no_grad():
	inputs = feature_extractor(
	[i["array"] for i in chunks["audio"]],
	return_tensors="pt",
	sampling_rate=sampling_rate,
	).to(device)
	logits = model(**inputs).logits
	y_pred = np.array(logits.cpu()).argmax(axis=-1)
	intervals = [frames_to_intervals(i) for i in y_pred]
	return {"y_pred": y_pred.tolist(), "intervals": intervals}


	ds = ds.map(evaluator, batched=True)
	print(ds["y_pred"][0])
	# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
	# with 0 indicating no filled pause detected in that frame

	print(ds["intervals"][0])
	# Prints the identified intervals as a list of [start_s, ends_s]:
	# [[0.08, 0.28 ], ...]
	```



	# Citation
	Coming soon.