Update README.md
Browse files
README.md
CHANGED
|
@@ -1,223 +1,11 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
## install
|
| 13 |
-
|
| 14 |
-
Dependencies:
|
| 15 |
-
|
| 16 |
-
- [pytorch](https://pytorch.org) <3
|
| 17 |
-
- [numpy](https://numpy.org/install/) <3
|
| 18 |
-
- `pip install transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
|
| 19 |
-
- `pip install datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
|
| 20 |
-
- `pip install tiktoken` for OpenAI's fast BPE code <3
|
| 21 |
-
- `pip install wandb` for optional logging <3
|
| 22 |
-
- `pip install tqdm` <3
|
| 23 |
-
|
| 24 |
-
## quick start
|
| 25 |
-
|
| 26 |
-
If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
|
| 27 |
-
|
| 28 |
-
```
|
| 29 |
-
$ python data/shakespeare_char/prepare.py
|
| 30 |
-
```
|
| 31 |
-
|
| 32 |
-
This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
|
| 33 |
-
|
| 34 |
-
**I have a GPU**. Great, we can quickly train a baby GPT with the settings provided in the [config/train_shakespeare_char.py](config/train_shakespeare_char.py) config file:
|
| 35 |
-
|
| 36 |
-
```
|
| 37 |
-
$ python train.py config/train_shakespeare_char.py
|
| 38 |
-
```
|
| 39 |
-
|
| 40 |
-
If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-shakespeare-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
|
| 41 |
-
|
| 42 |
-
```
|
| 43 |
-
$ python sample.py --out_dir=out-shakespeare-char
|
| 44 |
-
```
|
| 45 |
-
|
| 46 |
-
This generates a few samples, for example:
|
| 47 |
-
|
| 48 |
-
```
|
| 49 |
-
ANGELO:
|
| 50 |
-
And cowards it be strawn to my bed,
|
| 51 |
-
And thrust the gates of my threats,
|
| 52 |
-
Because he that ale away, and hang'd
|
| 53 |
-
An one with him.
|
| 54 |
-
|
| 55 |
-
DUKE VINCENTIO:
|
| 56 |
-
I thank your eyes against it.
|
| 57 |
-
|
| 58 |
-
DUKE VINCENTIO:
|
| 59 |
-
Then will answer him to save the malm:
|
| 60 |
-
And what have you tyrannous shall do this?
|
| 61 |
-
|
| 62 |
-
DUKE VINCENTIO:
|
| 63 |
-
If you have done evils of all disposition
|
| 64 |
-
To end his power, the day of thrust for a common men
|
| 65 |
-
That I leave, to fight with over-liking
|
| 66 |
-
Hasting in a roseman.
|
| 67 |
-
```
|
| 68 |
-
|
| 69 |
-
lol `¯\_(ツ)_/¯`. Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).
|
| 70 |
-
|
| 71 |
-
**I only have a macbook** (or other cheap computer). No worries, we can still train a GPT but we want to dial things down a notch. I recommend getting the bleeding edge PyTorch nightly ([select it here](https://pytorch.org/get-started/locally/) when installing) as it is currently quite likely to make your code more efficient. But even without it, a simple train run could look as follows:
|
| 72 |
-
|
| 73 |
-
```
|
| 74 |
-
$ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~3 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
-
$ python sample.py --out_dir=out-shakespeare-char --device=cpu
|
| 81 |
-
```
|
| 82 |
-
Generates samples like this:
|
| 83 |
-
|
| 84 |
-
```
|
| 85 |
-
GLEORKEN VINGHARD III:
|
| 86 |
-
Whell's the couse, the came light gacks,
|
| 87 |
-
And the for mought you in Aut fries the not high shee
|
| 88 |
-
bot thou the sought bechive in that to doth groan you,
|
| 89 |
-
No relving thee post mose the wear
|
| 90 |
-
```
|
| 91 |
-
|
| 92 |
-
Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.
|
| 93 |
-
|
| 94 |
-
Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.
|
| 95 |
-
|
| 96 |
-
## reproducing GPT-2
|
| 97 |
-
|
| 98 |
-
A more serious deep learning professional may be more interested in reproducing GPT-2 results. So here we go - we first tokenize the dataset, in this case the [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/), an open reproduction of OpenAI's (private) WebText:
|
| 99 |
-
|
| 100 |
-
```
|
| 101 |
-
$ python data/openwebtext/prepare.py
|
| 102 |
-
```
|
| 103 |
-
|
| 104 |
-
This downloads and tokenizes the [OpenWebText](https://huggingface.co/datasets/openwebtext) dataset. It will create a `train.bin` and `val.bin` which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training. To reproduce GPT-2 (124M) you'll want at least an 8X A100 40GB node and run:
|
| 105 |
-
|
| 106 |
-
```
|
| 107 |
-
$ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
This will run for about 4 days using PyTorch Distributed Data Parallel (DDP) and go down to loss of ~2.85. Now, a GPT-2 model just evaluated on OWT gets a val loss of about 3.11, but if you finetune it it will come down to ~2.85 territory (due to an apparent domain gap), making the two models ~match.
|
| 111 |
-
|
| 112 |
-
If you're in a cluster environment and you are blessed with multiple GPU nodes you can make GPU go brrrr e.g. across 2 nodes like:
|
| 113 |
-
|
| 114 |
-
```
|
| 115 |
-
Run on the first (master) node with example IP 123.456.123.456:
|
| 116 |
-
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
|
| 117 |
-
Run on the worker node:
|
| 118 |
-
$ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
|
| 119 |
-
```
|
| 120 |
-
|
| 121 |
-
It is a good idea to benchmark your interconnect (e.g. iperf3). In particular, if you don't have Infiniband then also prepend `NCCL_IB_DISABLE=1` to the above launches. Your multinode training will work, but most likely _crawl_. By default checkpoints are periodically written to the `--out_dir`. We can sample from the model by simply `$ python sample.py`.
|
| 122 |
-
|
| 123 |
-
Finally, to train on a single GPU simply run the `$ python train.py` script. Have a look at all of its args, the script tries to be very readable, hackable and transparent. You'll most likely want to tune a number of those variables depending on your needs.
|
| 124 |
-
|
| 125 |
-
## baselines
|
| 126 |
-
|
| 127 |
-
OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows:
|
| 128 |
-
|
| 129 |
-
```
|
| 130 |
-
$ python train.py eval_gpt2
|
| 131 |
-
$ python train.py eval_gpt2_medium
|
| 132 |
-
$ python train.py eval_gpt2_large
|
| 133 |
-
$ python train.py eval_gpt2_xl
|
| 134 |
-
```
|
| 135 |
-
|
| 136 |
-
and observe the following losses on train and val:
|
| 137 |
-
|
| 138 |
-
| model | params | train loss | val loss |
|
| 139 |
-
| ------| ------ | ---------- | -------- |
|
| 140 |
-
| gpt2 | 124M | 3.11 | 3.12 |
|
| 141 |
-
| gpt2-medium | 350M | 2.85 | 2.84 |
|
| 142 |
-
| gpt2-large | 774M | 2.66 | 2.67 |
|
| 143 |
-
| gpt2-xl | 1558M | 2.56 | 2.54 |
|
| 144 |
-
|
| 145 |
-
However, we have to note that GPT-2 was trained on (closed, never released) WebText, while OpenWebText is just a best-effort open reproduction of this dataset. This means there is a dataset domain gap. Indeed, taking the GPT-2 (124M) checkpoint and finetuning on OWT directly for a while reaches loss down to ~2.85. This then becomes the more appropriate baseline w.r.t. reproduction.
|
| 146 |
-
|
| 147 |
-
## finetuning
|
| 148 |
-
|
| 149 |
-
Finetuning is no different than training, we just make sure to initialize from a pretrained model and train with a smaller learning rate. For an example of how to finetune a GPT on new text go to `data/shakespeare` and run `prepare.py` to download the tiny shakespeare dataset and render it into a `train.bin` and `val.bin`, using the OpenAI BPE tokenizer from GPT-2. Unlike OpenWebText this will run in seconds. Finetuning can take very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:
|
| 150 |
-
|
| 151 |
-
```
|
| 152 |
-
$ python train.py config/finetune_shakespeare.py
|
| 153 |
-
```
|
| 154 |
-
|
| 155 |
-
This will load the config parameter overrides in `config/finetune_shakespeare.py` (I didn't tune them much though). Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. If you're running out of memory try decreasing the model size (they are `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) or possibly decreasing the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-shakespeare` by default, per the config file. You can then run the code in `sample.py --out_dir=out-shakespeare`:
|
| 156 |
-
|
| 157 |
-
```
|
| 158 |
-
THEODORE:
|
| 159 |
-
Thou shalt sell me to the highest bidder: if I die,
|
| 160 |
-
I sell thee to the first; if I go mad,
|
| 161 |
-
I sell thee to the second; if I
|
| 162 |
-
lie, I sell thee to the third; if I slay,
|
| 163 |
-
I sell thee to the fourth: so buy or sell,
|
| 164 |
-
I tell thee again, thou shalt not sell my
|
| 165 |
-
possession.
|
| 166 |
-
|
| 167 |
-
JULIET:
|
| 168 |
-
And if thou steal, thou shalt not sell thyself.
|
| 169 |
-
|
| 170 |
-
THEODORE:
|
| 171 |
-
I do not steal; I sell the stolen goods.
|
| 172 |
-
|
| 173 |
-
THEODORE:
|
| 174 |
-
Thou know'st not what thou sell'st; thou, a woman,
|
| 175 |
-
Thou art ever a victim, a thing of no worth:
|
| 176 |
-
Thou hast no right, no right, but to be sold.
|
| 177 |
-
```
|
| 178 |
-
|
| 179 |
-
Whoa there, GPT, entering some dark place over there. I didn't really tune the hyperparameters in the config too much, feel free to try!
|
| 180 |
-
|
| 181 |
-
## sampling / inference
|
| 182 |
-
|
| 183 |
-
Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:
|
| 184 |
-
|
| 185 |
-
```
|
| 186 |
-
$ python sample.py \
|
| 187 |
-
--init_from=gpt2-xl \
|
| 188 |
-
--start="What is the answer to life, the universe, and everything?" \
|
| 189 |
-
--num_samples=5 --max_new_tokens=100
|
| 190 |
-
```
|
| 191 |
-
|
| 192 |
-
If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g. `$ python sample.py --start=FILE:prompt.txt`.
|
| 193 |
-
|
| 194 |
-
## efficiency notes
|
| 195 |
-
|
| 196 |
-
For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.
|
| 197 |
-
|
| 198 |
-
Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!
|
| 199 |
-
|
| 200 |
-
## todos
|
| 201 |
-
|
| 202 |
-
- Investigate and add FSDP instead of DDP
|
| 203 |
-
- Eval zero-shot perplexities on standard evals (e.g. LAMBADA? HELM? etc.)
|
| 204 |
-
- Finetune the finetuning script, I think the hyperparams are not great
|
| 205 |
-
- Schedule for linear batch size increase during training
|
| 206 |
-
- Incorporate other embeddings (rotary, alibi)
|
| 207 |
-
- Separate out the optim buffers from model params in checkpoints I think
|
| 208 |
-
- Additional logging around network health (e.g. gradient clip events, magnitudes)
|
| 209 |
-
- Few more investigations around better init etc.
|
| 210 |
-
|
| 211 |
-
## troubleshooting
|
| 212 |
-
|
| 213 |
-
Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.
|
| 214 |
-
|
| 215 |
-
For some context on this repository, GPT, and language modeling it might be helpful to watch my [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.
|
| 216 |
-
|
| 217 |
-
For more questions/discussions feel free to stop by **#nanoGPT** on Discord:
|
| 218 |
-
|
| 219 |
-
[](https://discord.gg/3zy8kqD9Cp)
|
| 220 |
-
|
| 221 |
-
## acknowledgements
|
| 222 |
-
|
| 223 |
-
All nanoGPT experiments are powered by GPUs on [Lambda labs](https://lambdalabs.com), my favorite Cloud GPU provider. Thank you Lambda labs for sponsoring nanoGPT!
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: nanoGPT
|
| 3 |
+
emoji: ;)
|
| 4 |
+
colorFrom: gray
|
| 5 |
+
colorTo: blue
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 3.27.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: cc-by-nc-4.0
|
| 11 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|