File size: 5,840 Bytes
f47dcce de1360a 62b1587 f47dcce f07b7e9 f47dcce b5fc3ba f47dcce d13b9c6 53c5a64 b428b99 a2c2fab f47dcce b5fc3ba c540094 b5fc3ba c540094 2dbbe55 f47dcce 9640072 b5fc3ba b30abee b5fc3ba 96b36e3 b5fc3ba 5702069 2dbbe55 c540094 2dbbe55 b5fc3ba c540094 2dbbe55 c540094 2dbbe55 c540094 2dbbe55 f47dcce c540094 b5fc3ba c540094 2dbbe55 f47dcce c540094 f47dcce c540094 b5fc3ba c540094 db94671 f47dcce df706b9 2dbbe55 b5fc3ba 2dbbe55 b5fc3ba db94671 49b252b f47dcce db94671 f47dcce |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 |
---
tags:
- pyannote
- pyannote-audio
- pyannote-audio-model
- audio
- voice
- speech
- speaker
- speaker-segmentation
- voice-activity-detection
- overlapped-speech-detection
- resegmentation
license: mit
inference: false
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers apply for grants to improve it further. If you are an academic researcher, please cite the relevant papers in your own publications using the model. If you work for a company, please consider contributing back to pyannote.audio development (e.g. through unrestricted gifts). We also provide scientific consulting services around speaker diarization and machine listening."
extra_gated_fields:
Company/university: text
Website: text
I plan to use this model for (task, type of audio data, etc): text
---
Using this open-source model in production?
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
# 🎹 Speaker segmentation
[Paper](http://arxiv.org/abs/2104.04045) | [Demo](https://huggingface.co/spaces/pyannote/pretrained-pipelines) | [Blog post](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)

## Usage
Relies on pyannote.audio 2.1.1: see [installation instructions](https://github.com/pyannote/pyannote-audio).
```python
# 1. visit hf.co/pyannote/segmentation and accept user conditions
# 2. visit hf.co/settings/tokens to create an access token
# 3. instantiate pretrained model
from pyannote.audio import Model
model = Model.from_pretrained("pyannote/segmentation",
use_auth_token="ACCESS_TOKEN_GOES_HERE")
```
### Voice activity detection
```python
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
# onset/offset activation thresholds
"onset": 0.5, "offset": 0.5,
# remove speech regions shorter than that many seconds.
"min_duration_on": 0.0,
# fill non-speech regions shorter than that many seconds.
"min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")
# `vad` is a pyannote.core.Annotation instance containing speech regions
```
### Overlapped speech detection
```python
from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation=model)
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")
# `osd` is a pyannote.core.Annotation instance containing overlapped speech regions
```
### Resegmentation
```python
from pyannote.audio.pipelines import Resegmentation
pipeline = Resegmentation(segmentation=model,
diarization="baseline")
pipeline.instantiate(HYPER_PARAMETERS)
resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
# where `baseline` should be provided as a pyannote.core.Annotation instance
```
### Raw scores
```python
from pyannote.audio import Inference
inference = Inference(model)
segmentation = inference("audio.wav")
# `segmentation` is a pyannote.core.SlidingWindowFeature
# instance containing raw segmentation scores like the
# one pictured above (output)
```
## Citation
```bibtex
@inproceedings{Bredin2021,
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
Booktitle = {Proc. Interspeech 2021},
Address = {Brno, Czech Republic},
Month = {August},
Year = {2021},
```
```bibtex
@inproceedings{Bredin2020,
Title = {{pyannote.audio: neural building blocks for speaker diarization}},
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
Address = {Barcelona, Spain},
Month = {May},
Year = {2020},
}
```
## Reproducible research
In order to reproduce the results of the paper ["End-to-end speaker segmentation for overlap-aware resegmentation
"](https://arxiv.org/abs/2104.04045), use `pyannote/segmentation@Interspeech2021` with the following hyper-parameters:
| Voice activity detection | `onset` | `offset` | `min_duration_on` | `min_duration_off` |
| ------------------------ | ------- | -------- | ----------------- | ------------------ |
| AMI Mix-Headset | 0.684 | 0.577 | 0.181 | 0.037 |
| DIHARD3 | 0.767 | 0.377 | 0.136 | 0.067 |
| VoxConverse | 0.767 | 0.713 | 0.182 | 0.501 |
| Overlapped speech detection | `onset` | `offset` | `min_duration_on` | `min_duration_off` |
| --------------------------- | ------- | -------- | ----------------- | ------------------ |
| AMI Mix-Headset | 0.448 | 0.362 | 0.116 | 0.187 |
| DIHARD3 | 0.430 | 0.320 | 0.091 | 0.144 |
| VoxConverse | 0.587 | 0.426 | 0.337 | 0.112 |
| Resegmentation of VBx | `onset` | `offset` | `min_duration_on` | `min_duration_off` |
| --------------------- | ------- | -------- | ----------------- | ------------------ |
| AMI Mix-Headset | 0.542 | 0.527 | 0.044 | 0.705 |
| DIHARD3 | 0.592 | 0.489 | 0.163 | 0.182 |
| VoxConverse | 0.537 | 0.724 | 0.410 | 0.563 |
Expected outputs (and VBx baseline) are also provided in the `/reproducible_research` sub-directories.
|