Update README.md
Browse files
README.md
CHANGED
@@ -16,22 +16,28 @@ metrics:
|
|
16 |
|
17 |
# Frame classification for filled pauses
|
18 |
|
19 |
-
This model classifies individual 20ms frames of audio based on
|
|
|
20 |
|
21 |
-
It was trained on human-annotated Slovenian speech corpus [ROG-Artur](http://hdl.handle.net/11356/1992) and achieves F1 of 0.968 for the positive class on
|
22 |
-
te test split of the same dataset.
|
23 |
|
|
|
24 |
|
25 |
-
|
|
|
|
|
26 |
|
27 |
-
|
28 |
-
event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
|
29 |
-
events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
|
30 |
|
|
|
|
|
|
|
|
|
|
|
31 |
|
32 |
## Evaluation on ROG corpus
|
33 |
|
34 |
|
|
|
35 |
| postprocessing | recall | precision | F1 |
|
36 |
|------:|---------:|------------:|------:|
|
37 |
|none| 0.981 | 0.955 | 0.968 |
|
@@ -39,13 +45,18 @@ events partially overlap, this is counted as a true positive. We report precisio
|
|
39 |
|
40 |
## Evaluation on ParlaSpeech corpora
|
41 |
|
42 |
-
For every language in the
|
|
|
43 |
400 instances were sampled and annotated by human annotators.
|
44 |
|
45 |
|
46 |
-
Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
|
47 |
-
|
48 |
-
|
|
|
|
|
|
|
|
|
49 |
can be safely discarded.
|
50 |
|
51 |
With added postprocessing, the model achieves the following metrics:
|
|
|
16 |
|
17 |
# Frame classification for filled pauses
|
18 |
|
19 |
+
This model classifies individual 20ms frames of audio based on
|
20 |
+
presence of filled pauses ("eee", "errm", ...).
|
21 |
|
|
|
|
|
22 |
|
23 |
+
# Training data
|
24 |
|
25 |
+
The model was trained on human-annotated Slovenian speech corpus
|
26 |
+
[ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
|
27 |
+
at most 30s long chunks.
|
28 |
|
29 |
+
# Evaluation
|
|
|
|
|
30 |
|
31 |
+
Although the output of the model is a series 0 or 1, describing their 20ms frames,
|
32 |
+
the evaluation was done on event level; spans of consecutive outputs 1 were
|
33 |
+
bundled together into one event. When the true and predicted
|
34 |
+
events partially overlap, this is counted as a true positive.
|
35 |
+
We report precisions, recalls, and F1-scores of the positive class.
|
36 |
|
37 |
## Evaluation on ROG corpus
|
38 |
|
39 |
|
40 |
+
|
41 |
| postprocessing | recall | precision | F1 |
|
42 |
|------:|---------:|------------:|------:|
|
43 |
|none| 0.981 | 0.955 | 0.968 |
|
|
|
45 |
|
46 |
## Evaluation on ParlaSpeech corpora
|
47 |
|
48 |
+
For every language in the
|
49 |
+
[ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
|
50 |
400 instances were sampled and annotated by human annotators.
|
51 |
|
52 |
|
53 |
+
Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
|
54 |
+
we observed a few failure modes when inferring. It was discovered
|
55 |
+
that post-processing can be used to improve results. False positives
|
56 |
+
were observed to be caused by improper audio segmentation, which is
|
57 |
+
why disabling predictions that start at the start of the audio or
|
58 |
+
end at the end of the audio can be beneficial. Another failure mode
|
59 |
+
is predicting very short events, which is why ignoring very short predictions
|
60 |
can be safely discarded.
|
61 |
|
62 |
With added postprocessing, the model achieves the following metrics:
|