|
--- |
|
language: |
|
- "fr" |
|
tags: |
|
- "audio" |
|
- "speech" |
|
- "speaker-diarization" |
|
- "medkit" |
|
- "pyannote-audio" |
|
datasets: |
|
- "common_voice" |
|
- "pxcorpus" |
|
- "simsamu" |
|
metrics: |
|
- "der" |
|
--- |
|
|
|
# Simsamu diarization pipeline |
|
|
|
This repository contains a pretrained |
|
[pyannote-audio](https://github.com/pyannote/pyannote-audio) diarization |
|
pipeline that was fine-tuned on the |
|
[Simsamu](https://huggingface.co/datasets/medkit/simsamu) dataset. |
|
|
|
The pipeline uses a fine-tuned segmentation model based on |
|
https://huggingface.co/pyannote/segmentation-3.0 and pretrained embeddings from |
|
https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM. The pipeline |
|
hyperparameters were optimized. |
|
|
|
The pipeline can be used in [medkit](https://github.com/medkit-lib/medkit/) the |
|
following way: |
|
|
|
``` |
|
from medkit.core.audio import AudioDocument |
|
from medkit.audio.segmentation.pa_speaker_detector import PASpeakerDetector |
|
|
|
# init speaker detector operation |
|
speaker_detector = PASpeakerDetector( |
|
model="medkit/simsamu-diarization", |
|
device=0, |
|
segmentation_batch_size=10, |
|
embedding_batch_size=10, |
|
) |
|
|
|
# create audio document |
|
audio_doc = AudioDocument.from_file("path/to/audio.wav") |
|
|
|
# apply operation on audio document |
|
speech_segments = speaker_detector.run([audio_doc.raw_segment]) |
|
|
|
# display each speech turn and corresponding speaker |
|
for speech_seg in speech_segments: |
|
speaker_attr = speech_seg.attrs.get(label="speaker")[0] |
|
print(speech_seg.span.start, speech_seg.span.end, speaker_attr.value) |
|
``` |
|
|
|
More info at https://medkit.readthedocs.io/ |
|
|
|
See also: [Simsamu transcription |
|
model](https://huggingface.co/medkit/simsamu-transcription) |
|
|