5roop commited on
Commit
424f11f
·
verified ·
1 Parent(s): 7b97465

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -33
README.md CHANGED
@@ -28,21 +28,13 @@ Although the output of the model is a series 0 or 1, describing their 20ms fram
28
  event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
29
  events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
30
 
31
- We observed several failure modes of the automatic inferrence process and designed post-processing steps to mitigate them.
32
- False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
33
- end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
34
- can be safely discarded.
35
 
36
  ## Evaluation on ROG corpus
37
 
38
 
39
- | postprocessing | recall | precision | F1 |
40
- |:-----------------------|---------:|------------:|------:|
41
- | raw | 0.981 | 0.955 | 0.968 |
42
- | drop_short | 0.981 | 0.957 | 0.969 |
43
- | drop_short_initial_and_final | 0.964 | 0.966 | 0.965 |
44
- | drop_short_and_initial | 0.964 | 0.966 | 0.965 |
45
- | drop_initial | 0.964 | 0.963 | 0.963 |
46
 
47
 
48
  ## Evaluation on ParlaSpeech corpora
@@ -50,35 +42,21 @@ can be safely discarded.
50
  For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
51
  400 instances were sampled and annotated by human annotators.
52
 
53
- Evaluation on human-annotated instances produced the following metrics:
 
 
 
 
 
 
54
 
55
 
56
  | lang | postprocessing | recall | precision | F1 |
57
  |:-------|:-----------------------|---------:|------------:|------:|
58
  | CZ | drop_short_initial_and_final | 0.889 | 0.859 | 0.874 |
59
- | CZ | drop_short_and_initial | 0.889 | 0.859 | 0.874 |
60
- | CZ | drop_short | 0.905 | 0.833 | 0.868 |
61
- | CZ | drop_initial | 0.889 | 0.846 | 0.867 |
62
- | CZ | raw | 0.905 | 0.814 | 0.857 |
63
  | HR | drop_short_initial_and_final | 0.94 | 0.887 | 0.913 |
64
- | HR | drop_short_and_initial | 0.94 | 0.887 | 0.913 |
65
- | HR | drop_short | 0.94 | 0.884 | 0.911 |
66
- | HR | drop_initial | 0.94 | 0.875 | 0.906 |
67
- | HR | raw | 0.94 | 0.872 | 0.905 |
68
- | PL | drop_short | 0.906 | 0.947 | 0.926 |
69
  | PL | drop_short_initial_and_final | 0.903 | 0.947 | 0.924 |
70
- | PL | drop_short_and_initial | 0.903 | 0.947 | 0.924 |
71
- | PL | raw | 0.91 | 0.924 | 0.917 |
72
- | PL | drop_initial | 0.908 | 0.924 | 0.916 |
73
- | RS | drop_short | 0.966 | 0.915 | 0.94 |
74
  | RS | drop_short_initial_and_final | 0.966 | 0.915 | 0.94 |
75
- | RS | drop_short_and_initial | 0.966 | 0.915 | 0.94 |
76
- | RS | drop_initial | 0.974 | 0.9 | 0.936 |
77
- | RS | raw | 0.974 | 0.9 | 0.936 |
78
-
79
- The metrics reported are on event level, which means that if true and
80
- predicted filled pauses at least partially overlap, we count them as a
81
- True Positive event.
82
 
83
 
84
 
@@ -109,7 +87,7 @@ def frames_to_intervals(
109
  frames: list[int],
110
  drop_short=True,
111
  drop_initial=True,
112
- drop_final=False,
113
  short_cutoff_s=0.08,
114
  ) -> list[tuple[float]]:
115
  """Transforms a list of ones or zeros, corresponding to annotations on frame
 
28
  event level; spans of consecutive outputs 1 were bundled together into one event. When the true and predicted
29
  events partially overlap, this is counted as a true positive. We report precisions, recalls, and f1-scores of the positive class.
30
 
 
 
 
 
31
 
32
  ## Evaluation on ROG corpus
33
 
34
 
35
+ | recall | precision | F1 |
36
+ |---------:|------------:|------:|
37
+ | 0.981 | 0.955 | 0.968 |
 
 
 
 
38
 
39
 
40
  ## Evaluation on ParlaSpeech corpora
 
42
  For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech-670923f23ab185f413d40795),
43
  400 instances were sampled and annotated by human annotators.
44
 
45
+
46
+ Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring. It was discovered that post-processing can be used
47
+ to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or
48
+ end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions
49
+ can be safely discarded.
50
+
51
+ With added postprocessing, the model achieves the following metrics:
52
 
53
 
54
  | lang | postprocessing | recall | precision | F1 |
55
  |:-------|:-----------------------|---------:|------------:|------:|
56
  | CZ | drop_short_initial_and_final | 0.889 | 0.859 | 0.874 |
 
 
 
 
57
  | HR | drop_short_initial_and_final | 0.94 | 0.887 | 0.913 |
 
 
 
 
 
58
  | PL | drop_short_initial_and_final | 0.903 | 0.947 | 0.924 |
 
 
 
 
59
  | RS | drop_short_initial_and_final | 0.966 | 0.915 | 0.94 |
 
 
 
 
 
 
 
60
 
61
 
62
 
 
87
  frames: list[int],
88
  drop_short=True,
89
  drop_initial=True,
90
+ drop_final=True,
91
  short_cutoff_s=0.08,
92
  ) -> list[tuple[float]]:
93
  """Transforms a list of ones or zeros, corresponding to annotations on frame