File size: 4,208 Bytes
6952aa9
 
0ed5edc
 
 
 
 
 
1801601
6952aa9
4c6fb9e
b780acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
aa80a70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b780acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ca5772
b780acb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c6fb9e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
library_name: keras-hub
license: mit
language:
- en
tags:
- automatic-speech-recognition
- keras
pipeline_tag: automatic-speech-recognition
---
### Model Overview
⚠️ Whisper is currently only available via the `keras-hub-nightly` package. Use `pip install keras-hub-nightly` to try this model.

A Whisper encoder-decoder network for speech.

This class implements a Transformer-based encoder-decoder model as
described in
["Robust Speech Recognition via Large-Scale Weak Supervision"](https://arxiv.org/abs/2212.04356).
It includes the embedding lookups and transformer layers, but not the head
for predicting the next token.

The default constructor gives a fully customizable, randomly initialized Whisper
model with any number of layers, heads, and embedding dimensions. To load
preset architectures and weights, use the `from_preset()` constructor.

Disclaimer: Pre-trained models are provided on an "as is" basis, without
warranties or conditions of any kind. The underlying model is provided by a
third party and subject to a separate license, available
[here](https://github.com/openai/whisper).

## Links
* [Whisper Quickstart Notebook](coming soon)
* [Whisper API Documentation](https://keras.io/keras_hub/api/models/whisper/)
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

## Installation

Keras and KerasHub can be installed with:

```
pip install -U -q keras-hub
pip install -U -q keras
```

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

__Arguments__


- __vocabulary_size__: int. The size of the token vocabulary.
- __num_layers__: int. The number of transformer encoder layers and
    transformer decoder layers.
- __num_heads__: int. The number of attention heads for each transformer.
    The hidden size must be divisible by the number of attention heads.
- __hidden_dim__: int. The size of the transformer encoding and pooler layers.
- __intermediate_dim__: int. The output dimension of the first Dense layer in
    a two-layer feedforward network for each transformer.
- __num_mels__: int. The number of mel-frequency filters. Defaults to `80`.
- __dropout__: float. Dropout probability for the Transformer encoder.
- __max_encoder_sequence_length__: int. The maximum sequence length that the
    audio encoder can consume. Since the second convolutional layer in
    the encoder reduces the sequence length by half (stride of 2), we
    use `max_encoder_sequence_length // 2` as the sequence length for the
    positional embedding layer.
- __max_decoder_sequence_length__: int. The maximum sequence length that the
    text decoder can consume.

## Example Usage
```python
import keras_hub
import keras_core as keras
import numpy as np
```



```python
input_data = {
    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
    "decoder_padding_mask": np.array(
        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
    ),
}

# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
    vocabulary_size=51864,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_encoder_sequence_length=128,
    max_decoder_sequence_length=128,
)
model(input_data)
```

## Example Usage with Hugging Face URI

```python
import keras_hub
import keras_core as keras
import numpy as np
```



```python
input_data = {
    "encoder_features": np.ones(shape=(1, 12, 80), dtype="int32"),
    "decoder_token_ids": np.ones(shape=(1, 12), dtype="int32"),
    "decoder_padding_mask": np.array(
        [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]
    ),
}

# Randomly initialized Whisper encoder-decoder model with a custom config.
model = keras_hub.models.WhisperBackbone(
    vocabulary_size=51864,
    num_layers=4,
    num_heads=4,
    hidden_dim=256,
    intermediate_dim=512,
    max_encoder_sequence_length=128,
    max_decoder_sequence_length=128,
)
model(input_data)
```