5roop commited on
Commit
a34474e
·
verified ·
1 Parent(s): a816954

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -11
README.md CHANGED
@@ -16,22 +16,28 @@ metrics:
16
 
17
  # Frame classification for filled pauses
18
 
19
- This model classifies individual 20ms frames of audio based on presence of filled pauses ("eee", "errm", ...).
 
20
 
21
- It was trained on human-annotated Slovenian speech corpus [ROG-Artur](http://hdl.handle.net/11356/1992) and achieves F1 of 0.968 for the positive class on
22
- te test split of the same dataset.
23
 
 
24
 
25
- # Evaluation
 
 
26
 
27
- Although the output of the model is a series 0 or 1, describing their 20ms frames, the evaluation was done on
28
- event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
29
- events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
30
 
 
 
 
 
 
31
 
32
  ## Evaluation on ROG corpus
33
 
34
 
 
35
  | postprocessing | recall | precision | F1 |
36
  |------:|---------:|------------:|------:|
37
  |none| 0.981 | 0.955 | 0.968 |
@@ -39,13 +45,18 @@ events partially overlap, this is counted as a true positive. We report precisio
39
 
40
  ## Evaluation on ParlaSpeech corpora
41
 
42
- For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
 
43
  400 instances were sampled and annotated by human annotators.
44
 
45
 
46
- Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring. It was discovered that post-processing can be used
47
- to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
48
- end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
 
 
 
 
49
  can be safely discarded.
50
 
51
  With added postprocessing, the model achieves the following metrics:
 
16
 
17
  # Frame classification for filled pauses
18
 
19
+ This model classifies individual 20ms frames of audio based on
20
+ presence of filled pauses ("eee", "errm", ...).
21
 
 
 
22
 
23
+ # Training data
24
 
25
+ The model was trained on human-annotated Slovenian speech corpus
26
+ [ROG-Artur](http://hdl.handle.net/11356/1992). Recordings from the train split were segmented into
27
+ at most 30s long chunks.
28
 
29
+ # Evaluation
 
 
30
 
31
+ Although the output of the model is a series 0 or 1, describing their 20ms frames,
32
+ the evaluation was done on event level; spans of consecutive outputs 1 were
33
+ bundled together into one event. When the true and predicted
34
+ events partially overlap, this is counted as a true positive.
35
+ We report precisions, recalls, and F1-scores of the positive class.
36
 
37
  ## Evaluation on ROG corpus
38
 
39
 
40
+
41
  | postprocessing | recall | precision | F1 |
42
  |------:|---------:|------------:|------:|
43
  |none| 0.981 | 0.955 | 0.968 |
 
45
 
46
  ## Evaluation on ParlaSpeech corpora
47
 
48
+ For every language in the
49
+ [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
50
  400 instances were sampled and annotated by human annotators.
51
 
52
 
53
+ Since ParlaSpeech corpora are too big to be manually segmented as ROG is,
54
+ we observed a few failure modes when inferring. It was discovered
55
+ that post-processing can be used to improve results. False positives
56
+ were observed to be caused by improper audio segmentation, which is
57
+ why disabling predictions that start at the start of the audio or
58
+ end at the end of the audio can be beneficial. Another failure mode
59
+ is predicting very short events, which is why ignoring very short predictions
60
  can be safely discarded.
61
 
62
  With added postprocessing, the model achieves the following metrics: