File size: 5,398 Bytes
13d3ba0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 |
# Dolly-V2
Transformer architecture: GPT-NeoX
Modeled from examples/stablelm
Ref: https://github.com/databrickslabs/dolly
Ref: https://github.com/stability-AI/stableLM/#stablelm-alpha
## Usage
```bash
# get the repo and build it
git clone https://github.com/ggerganov/ggml
cd ggml
mkdir build && cd build
cmake ..
make -j
# get the Dolly-V2 3B model
git clone https://huggingface.co/databricks/dolly-v2-3b
# install Python dependencies
python3 -m pip install -r ../requirements.txt
# convert model to FP16
python3 ../examples/dolly-v2/convert-h5-to-ggml.py ./dolly-v2-3b/ 1
# run inference using FP16 precision
./bin/dollyv2 -m ./dolly-v2-3b/ggml-model-f16.bin -p "State the meaning of life." -t 6 -n 64
main: seed = 1683218142
dollyv2_model_load: loading model from './dolly-v2-3b/ggml-model-f16.bin' - please wait ...
dollyv2_model_load: n_vocab = 50280
dollyv2_model_load: n_ctx = 2048
dollyv2_model_load: n_embd = 2560
dollyv2_model_load: n_head = 32
dollyv2_model_load: n_layer = 32
dollyv2_model_load: n_rot = 20
dollyv2_model_load: ftype = 1
dollyv2_model_load: ggml ctx size = 7374.91 MB
dollyv2_model_load: memory_size = 640.00 MB, n_mem = 65536
dollyv2_model_load: ................................................ done
dollyv2_model_load: model size = 5295.10 MB / num tensors = 388
main: number of tokens in prompt = 32
main: token[0] = 30003, Below
main: token[1] = 310, is
main: token[2] = 271, an
main: token[3] = 9775, instruction
main: token[4] = 326, that
main: token[5] = 8631, describes
main: token[6] = 247, a
main: token[7] = 4836, task
main: token[8] = 964, .
main: token[9] = 19566, Write
main: token[10] = 247, a
main: token[11] = 2380, response
main: token[12] = 326, that
main: token[13] = 20420, appropriately
main: token[14] = 29141, completes
main: token[15] = 253, the
main: token[16] = 2748, request
main: token[17] = 964, .
main: token[18] = 187,
main: token[19] = 187,
main: token[20] = 50278, ### Instruction:
main: token[21] = 187,
main: token[22] = 5443, State
main: token[23] = 253, the
main: token[24] = 4495, meaning
main: token[25] = 273, of
main: token[26] = 1495, life
main: token[27] = 964, .
main: token[28] = 187,
main: token[29] = 187,
main: token[30] = 50279, ### Response:
main: token[31] = 187,
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
State the meaning of life.
### Response:
The meaning of life is to love and be loved.
### End
main: mem per token = 16136720 bytes
main: load time = 2202.58 ms
main: sample time = 2.57 ms
main: predict time = 1497.14 ms / 33.27 ms per token
main: total time = 6187.27 ms
```
## 5-bit integer quantization mode
```bash
# quantize the model to 5-bits using Q5_0 quantization
./bin/dollyv2-quantize ./dolly-v2-3b/ggml-model-f16.bin ./dolly-v2-3b/ggml-model-q5_0.bin q5_0
# run the quantized model
./bin/dollyv2 -m ./dolly-v2-3b/ggml-model-q5_0.bin -p "State the meaning of life." -t 6 -n 64
main: seed = 1683218518
dollyv2_model_load: loading model from './dolly-v2-3b/ggml-model-q5_0.bin' - please wait ...
dollyv2_model_load: n_vocab = 50280
dollyv2_model_load: n_ctx = 2048
dollyv2_model_load: n_embd = 2560
dollyv2_model_load: n_head = 32
dollyv2_model_load: n_layer = 32
dollyv2_model_load: n_rot = 20
dollyv2_model_load: ftype = 8
dollyv2_model_load: ggml ctx size = 3902.68 MB
dollyv2_model_load: memory_size = 640.00 MB, n_mem = 65536
dollyv2_model_load: ................................................ done
dollyv2_model_load: model size = 1822.87 MB / num tensors = 388
main: number of tokens in prompt = 32
main: token[0] = 30003, Below
main: token[1] = 310, is
main: token[2] = 271, an
main: token[3] = 9775, instruction
main: token[4] = 326, that
main: token[5] = 8631, describes
main: token[6] = 247, a
main: token[7] = 4836, task
main: token[8] = 964, .
main: token[9] = 19566, Write
main: token[10] = 247, a
main: token[11] = 2380, response
main: token[12] = 326, that
main: token[13] = 20420, appropriately
main: token[14] = 29141, completes
main: token[15] = 253, the
main: token[16] = 2748, request
main: token[17] = 964, .
main: token[18] = 187,
main: token[19] = 187,
main: token[20] = 50278, ### Instruction:
main: token[21] = 187,
main: token[22] = 5443, State
main: token[23] = 253, the
main: token[24] = 4495, meaning
main: token[25] = 273, of
main: token[26] = 1495, life
main: token[27] = 964, .
main: token[28] = 187,
main: token[29] = 187,
main: token[30] = 50279, ### Response:
main: token[31] = 187,
Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
State the meaning of life.
### Response:
The meaning of life is the discovery of the true self.
### End
main: mem per token = 16127760 bytes
main: load time = 1011.09 ms
main: sample time = 2.79 ms
main: predict time = 1271.62 ms / 27.64 ms per token
main: total time = 2802.51 ms
```
## Notes
- No guarantees for correctness
- The tokenizer is currently hacked - probably works only for English
- Non-parallel residual is not supported
- Contributions and improvements are welcome
|