File size: 39,361 Bytes
fd2913b fcb4d15 f7f5c23 fcb4d15 8a00b89 fcb4d15 fd2913b fcb4d15 fd2913b aa805c1 fcb4d15 d7cbfc5 e112c07 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 |
---
license: apache-2.0
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- 'no'
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
tags:
- audio
- automatic-speech-recognition
base_model: openai/whisper-small
pipeline_tag: automatic-speech-recognition
---
# Whisper-small OpenVINO IR
Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours
of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains **without** the need
for fine-tuning.
Whisper was proposed in the paper [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356)
by Alec Radford et al from OpenAI. The original code repository can be found [here](https://github.com/openai/whisper).
**Disclaimer**: Content for this model card has partly been copied and pasted from [this model card](https://huggingface.co/openai/whisper-small).
# Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model.

| Model Type | Parameters | n_audio_ctx | n_audio_state | n_audio_head | n_audio_layer | n_text_ctx | n_text_state | n_text_head | n_text_layer | n_mels | n_vocab |
|---------------------------|------------|-------------|---------------|--------------|---------------|------------|--------------|-------------|--------------|--------|---------|
| whisper-tiny | 39 M | 1500 | 384 | 6 | 4 | 224 | 384 | 6 | 4 | 80 | 51865 |
| whisper-base | 74 M | 1500 | 512 | 8 | 6 | 224 | 512 | 8 | 6 | 80 | 51865 |
| **whisper-small** | 244 M | 1500 | 768 | 12 | 12 | 224 | 768 | 12 | 12 | 80 | 51865 |
| whisper-medium | 769 M | 1500 | 1024 | 16 | 24 | 224 | 1024 | 16 | 16 | 80 | 51865 |
| whisper-large-v1 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 80 | 51865 |
| whisper-large-v2 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 80 | 51865 |
| distil-whisper-large-v2 | 756 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 2 | 80 | 51865 |
| whisper-large-v3 | 1550 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 20 | 128 | 51866 |
| distil-whisper-large-v3 | 756 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 2 | 128 | 51866 |
| whisper-large-v3-turbo | 809 M | 1500 | 1280 | 20 | 32 | 224 | 1280 | 20 | 4 | 128 | 51866 |
|