torinriley commited on
Commit
d5117be
·
1 Parent(s): 0c79708

readme fix

Browse files
Files changed (2) hide show
  1. DOCS.md +230 -0
  2. README.md +8 -230
DOCS.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # nanoGPT
3
+
4
+ ![nanoGPT](assets/nanogpt.jpg)
5
+
6
+ The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file `train.py` reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training. The code itself is plain and readable: `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
7
+
8
+ ![repro124m](assets/gpt2_124M_loss.png)
9
+
10
+ Because the code is so simple, it is very easy to hack to your needs, train new models from scratch, or finetune pretrained checkpoints (e.g. biggest one currently available as a starting point would be the GPT-2 1.3B model from OpenAI).
11
+
12
+ ## install
13
+
14
+ ```
15
+ pip install torch numpy transformers datasets tiktoken wandb tqdm
16
+ ```
17
+
18
+ Dependencies:
19
+
20
+ - [pytorch](https://pytorch.org) <3
21
+ - [numpy](https://numpy.org/install/) <3
22
+ - `transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
23
+ - `datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
24
+ - `tiktoken` for OpenAI's fast BPE code <3
25
+ - `wandb` for optional logging <3
26
+ - `tqdm` for progress bars <3
27
+
28
+ ## quick start
29
+
30
+ If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
31
+
32
+ ```sh
33
+ python data/shakespeare_char/prepare.py
34
+ ```
35
+
36
+ This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
37
+
38
+ **I have a GPU**. Great, we can quickly train a baby GPT with the settings provided in the [config/train_shakespeare_char.py](config/train_shakespeare_char.py) config file:
39
+
40
+ ```sh
41
+ python train.py config/train_shakespeare_char.py
42
+ ```
43
+
44
+ If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-shakespeare-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
45
+
46
+ ```sh
47
+ python sample.py --out_dir=out-shakespeare-char
48
+ ```
49
+
50
+ This generates a few samples, for example:
51
+
52
+ ```
53
+ ANGELO:
54
+ And cowards it be strawn to my bed,
55
+ And thrust the gates of my threats,
56
+ Because he that ale away, and hang'd
57
+ An one with him.
58
+
59
+ DUKE VINCENTIO:
60
+ I thank your eyes against it.
61
+
62
+ DUKE VINCENTIO:
63
+ Then will answer him to save the malm:
64
+ And what have you tyrannous shall do this?
65
+
66
+ DUKE VINCENTIO:
67
+ If you have done evils of all disposition
68
+ To end his power, the day of thrust for a common men
69
+ That I leave, to fight with over-liking
70
+ Hasting in a roseman.
71
+ ```
72
+
73
+ lol `¯\_(ツ)_/¯`. Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).
74
+
75
+ **I only have a macbook** (or other cheap computer). No worries, we can still train a GPT but we want to dial things down a notch. I recommend getting the bleeding edge PyTorch nightly ([select it here](https://pytorch.org/get-started/locally/) when installing) as it is currently quite likely to make your code more efficient. But even without it, a simple train run could look as follows:
76
+
77
+ ```sh
78
+ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
79
+ ```
80
+
81
+ Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~3 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:
82
+
83
+ ```sh
84
+ python sample.py --out_dir=out-shakespeare-char --device=cpu
85
+ ```
86
+ Generates samples like this:
87
+
88
+ ```
89
+ GLEORKEN VINGHARD III:
90
+ Whell's the couse, the came light gacks,
91
+ And the for mought you in Aut fries the not high shee
92
+ bot thou the sought bechive in that to doth groan you,
93
+ No relving thee post mose the wear
94
+ ```
95
+
96
+ Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.
97
+
98
+ Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.
99
+
100
+
101
+
102
+
103
+ ## reproducing GPT-2
104
+
105
+ A more serious deep learning professional may be more interested in reproducing GPT-2 results. So here we go - we first tokenize the dataset, in this case the [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/), an open reproduction of OpenAI's (private) WebText:
106
+
107
+ ```sh
108
+ python data/openwebtext/prepare.py
109
+ ```
110
+
111
+ This downloads and tokenizes the [OpenWebText](https://huggingface.co/datasets/openwebtext) dataset. It will create a `train.bin` and `val.bin` which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training. To reproduce GPT-2 (124M) you'll want at least an 8X A100 40GB node and run:
112
+
113
+ ```sh
114
+ torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
115
+ ```
116
+
117
+ This will run for about 4 days using PyTorch Distributed Data Parallel (DDP) and go down to loss of ~2.85. Now, a GPT-2 model just evaluated on OWT gets a val loss of about 3.11, but if you finetune it it will come down to ~2.85 territory (due to an apparent domain gap), making the two models ~match.
118
+
119
+ If you're in a cluster environment and you are blessed with multiple GPU nodes you can make GPU go brrrr e.g. across 2 nodes like:
120
+
121
+ ```sh
122
+ # Run on the first (master) node with example IP 123.456.123.456:
123
+ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
124
+ # Run on the worker node:
125
+ torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
126
+ ```
127
+
128
+ It is a good idea to benchmark your interconnect (e.g. iperf3). In particular, if you don't have Infiniband then also prepend `NCCL_IB_DISABLE=1` to the above launches. Your multinode training will work, but most likely _crawl_. By default checkpoints are periodically written to the `--out_dir`. We can sample from the model by simply `python sample.py`.
129
+
130
+ Finally, to train on a single GPU simply run the `python train.py` script. Have a look at all of its args, the script tries to be very readable, hackable and transparent. You'll most likely want to tune a number of those variables depending on your needs.
131
+
132
+ ## baselines
133
+
134
+ OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows:
135
+
136
+ ```sh
137
+ $ python train.py config/eval_gpt2.py
138
+ $ python train.py config/eval_gpt2_medium.py
139
+ $ python train.py config/eval_gpt2_large.py
140
+ $ python train.py config/eval_gpt2_xl.py
141
+ ```
142
+
143
+ and observe the following losses on train and val:
144
+
145
+ | model | params | train loss | val loss |
146
+ | ------| ------ | ---------- | -------- |
147
+ | gpt2 | 124M | 3.11 | 3.12 |
148
+ | gpt2-medium | 350M | 2.85 | 2.84 |
149
+ | gpt2-large | 774M | 2.66 | 2.67 |
150
+ | gpt2-xl | 1558M | 2.56 | 2.54 |
151
+
152
+ However, we have to note that GPT-2 was trained on (closed, never released) WebText, while OpenWebText is just a best-effort open reproduction of this dataset. This means there is a dataset domain gap. Indeed, taking the GPT-2 (124M) checkpoint and finetuning on OWT directly for a while reaches loss down to ~2.85. This then becomes the more appropriate baseline w.r.t. reproduction.
153
+
154
+ ## finetuning
155
+
156
+ Finetuning is no different than training, we just make sure to initialize from a pretrained model and train with a smaller learning rate. For an example of how to finetune a GPT on new text go to `data/shakespeare` and run `prepare.py` to download the tiny shakespeare dataset and render it into a `train.bin` and `val.bin`, using the OpenAI BPE tokenizer from GPT-2. Unlike OpenWebText this will run in seconds. Finetuning can take very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:
157
+
158
+ ```sh
159
+ python train.py config/finetune_shakespeare.py
160
+ ```
161
+
162
+ This will load the config parameter overrides in `config/finetune_shakespeare.py` (I didn't tune them much though). Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. If you're running out of memory try decreasing the model size (they are `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) or possibly decreasing the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-shakespeare` by default, per the config file. You can then run the code in `sample.py --out_dir=out-shakespeare`:
163
+
164
+ ```
165
+ THEODORE:
166
+ Thou shalt sell me to the highest bidder: if I die,
167
+ I sell thee to the first; if I go mad,
168
+ I sell thee to the second; if I
169
+ lie, I sell thee to the third; if I slay,
170
+ I sell thee to the fourth: so buy or sell,
171
+ I tell thee again, thou shalt not sell my
172
+ possession.
173
+
174
+ JULIET:
175
+ And if thou steal, thou shalt not sell thyself.
176
+
177
+ THEODORE:
178
+ I do not steal; I sell the stolen goods.
179
+
180
+ THEODORE:
181
+ Thou know'st not what thou sell'st; thou, a woman,
182
+ Thou art ever a victim, a thing of no worth:
183
+ Thou hast no right, no right, but to be sold.
184
+ ```
185
+
186
+ Whoa there, GPT, entering some dark place over there. I didn't really tune the hyperparameters in the config too much, feel free to try!
187
+
188
+ ## sampling / inference
189
+
190
+ Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:
191
+
192
+ ```sh
193
+ python sample.py \
194
+ --init_from=gpt2-xl \
195
+ --start="What is the answer to life, the universe, and everything?" \
196
+ --num_samples=5 --max_new_tokens=100
197
+ ```
198
+
199
+ If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g. ```python sample.py --start=FILE:prompt.txt```.
200
+
201
+ ## efficiency notes
202
+
203
+ For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.
204
+
205
+ Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!
206
+
207
+ ## todos
208
+
209
+ - Investigate and add FSDP instead of DDP
210
+ - Eval zero-shot perplexities on standard evals (e.g. LAMBADA? HELM? etc.)
211
+ - Finetune the finetuning script, I think the hyperparams are not great
212
+ - Schedule for linear batch size increase during training
213
+ - Incorporate other embeddings (rotary, alibi)
214
+ - Separate out the optim buffers from model params in checkpoints I think
215
+ - Additional logging around network health (e.g. gradient clip events, magnitudes)
216
+ - Few more investigations around better init etc.
217
+
218
+ ## troubleshooting
219
+
220
+ Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.
221
+
222
+ For some context on this repository, GPT, and language modeling it might be helpful to watch my [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.
223
+
224
+ For more questions/discussions feel free to stop by **#nanoGPT** on Discord:
225
+
226
+ [![](https://dcbadge.vercel.app/api/server/3zy8kqD9Cp?compact=true&style=flat)](https://discord.gg/3zy8kqD9Cp)
227
+
228
+ ## acknowledgements
229
+
230
+ All nanoGPT experiments are powered by GPUs on [Lambda labs](https://lambdalabs.com), my favorite Cloud GPU provider. Thank you Lambda labs for sponsoring nanoGPT!
README.md CHANGED
@@ -1,230 +1,8 @@
1
-
2
- # nanoGPT
3
-
4
- ![nanoGPT](assets/nanogpt.jpg)
5
-
6
- The simplest, fastest repository for training/finetuning medium-sized GPTs. It is a rewrite of [minGPT](https://github.com/karpathy/minGPT) that prioritizes teeth over education. Still under active development, but currently the file `train.py` reproduces GPT-2 (124M) on OpenWebText, running on a single 8XA100 40GB node in about 4 days of training. The code itself is plain and readable: `train.py` is a ~300-line boilerplate training loop and `model.py` a ~300-line GPT model definition, which can optionally load the GPT-2 weights from OpenAI. That's it.
7
-
8
- ![repro124m](assets/gpt2_124M_loss.png)
9
-
10
- Because the code is so simple, it is very easy to hack to your needs, train new models from scratch, or finetune pretrained checkpoints (e.g. biggest one currently available as a starting point would be the GPT-2 1.3B model from OpenAI).
11
-
12
- ## install
13
-
14
- ```
15
- pip install torch numpy transformers datasets tiktoken wandb tqdm
16
- ```
17
-
18
- Dependencies:
19
-
20
- - [pytorch](https://pytorch.org) <3
21
- - [numpy](https://numpy.org/install/) <3
22
- - `transformers` for huggingface transformers <3 (to load GPT-2 checkpoints)
23
- - `datasets` for huggingface datasets <3 (if you want to download + preprocess OpenWebText)
24
- - `tiktoken` for OpenAI's fast BPE code <3
25
- - `wandb` for optional logging <3
26
- - `tqdm` for progress bars <3
27
-
28
- ## quick start
29
-
30
- If you are not a deep learning professional and you just want to feel the magic and get your feet wet, the fastest way to get started is to train a character-level GPT on the works of Shakespeare. First, we download it as a single (1MB) file and turn it from raw text into one large stream of integers:
31
-
32
- ```sh
33
- python data/shakespeare_char/prepare.py
34
- ```
35
-
36
- This creates a `train.bin` and `val.bin` in that data directory. Now it is time to train your GPT. The size of it very much depends on the computational resources of your system:
37
-
38
- **I have a GPU**. Great, we can quickly train a baby GPT with the settings provided in the [config/train_shakespeare_char.py](config/train_shakespeare_char.py) config file:
39
-
40
- ```sh
41
- python train.py config/train_shakespeare_char.py
42
- ```
43
-
44
- If you peek inside it, you'll see that we're training a GPT with a context size of up to 256 characters, 384 feature channels, and it is a 6-layer Transformer with 6 heads in each layer. On one A100 GPU this training run takes about 3 minutes and the best validation loss is 1.4697. Based on the configuration, the model checkpoints are being written into the `--out_dir` directory `out-shakespeare-char`. So once the training finishes we can sample from the best model by pointing the sampling script at this directory:
45
-
46
- ```sh
47
- python sample.py --out_dir=out-shakespeare-char
48
- ```
49
-
50
- This generates a few samples, for example:
51
-
52
- ```
53
- ANGELO:
54
- And cowards it be strawn to my bed,
55
- And thrust the gates of my threats,
56
- Because he that ale away, and hang'd
57
- An one with him.
58
-
59
- DUKE VINCENTIO:
60
- I thank your eyes against it.
61
-
62
- DUKE VINCENTIO:
63
- Then will answer him to save the malm:
64
- And what have you tyrannous shall do this?
65
-
66
- DUKE VINCENTIO:
67
- If you have done evils of all disposition
68
- To end his power, the day of thrust for a common men
69
- That I leave, to fight with over-liking
70
- Hasting in a roseman.
71
- ```
72
-
73
- lol `¯\_(ツ)_/¯`. Not bad for a character-level model after 3 minutes of training on a GPU. Better results are quite likely obtainable by instead finetuning a pretrained GPT-2 model on this dataset (see finetuning section later).
74
-
75
- **I only have a macbook** (or other cheap computer). No worries, we can still train a GPT but we want to dial things down a notch. I recommend getting the bleeding edge PyTorch nightly ([select it here](https://pytorch.org/get-started/locally/) when installing) as it is currently quite likely to make your code more efficient. But even without it, a simple train run could look as follows:
76
-
77
- ```sh
78
- python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
79
- ```
80
-
81
- Here, since we are running on CPU instead of GPU we must set both `--device=cpu` and also turn off PyTorch 2.0 compile with `--compile=False`. Then when we evaluate we get a bit more noisy but faster estimate (`--eval_iters=20`, down from 200), our context size is only 64 characters instead of 256, and the batch size only 12 examples per iteration, not 64. We'll also use a much smaller Transformer (4 layers, 4 heads, 128 embedding size), and decrease the number of iterations to 2000 (and correspondingly usually decay the learning rate to around max_iters with `--lr_decay_iters`). Because our network is so small we also ease down on regularization (`--dropout=0.0`). This still runs in about ~3 minutes, but gets us a loss of only 1.88 and therefore also worse samples, but it's still good fun:
82
-
83
- ```sh
84
- python sample.py --out_dir=out-shakespeare-char --device=cpu
85
- ```
86
- Generates samples like this:
87
-
88
- ```
89
- GLEORKEN VINGHARD III:
90
- Whell's the couse, the came light gacks,
91
- And the for mought you in Aut fries the not high shee
92
- bot thou the sought bechive in that to doth groan you,
93
- No relving thee post mose the wear
94
- ```
95
-
96
- Not bad for ~3 minutes on a CPU, for a hint of the right character gestalt. If you're willing to wait longer, feel free to tune the hyperparameters, increase the size of the network, the context length (`--block_size`), the length of training, etc.
97
-
98
- Finally, on Apple Silicon Macbooks and with a recent PyTorch version make sure to add `--device=mps` (short for "Metal Performance Shaders"); PyTorch then uses the on-chip GPU that can *significantly* accelerate training (2-3X) and allow you to use larger networks. See [Issue 28](https://github.com/karpathy/nanoGPT/issues/28) for more.
99
-
100
-
101
-
102
-
103
- ## reproducing GPT-2
104
-
105
- A more serious deep learning professional may be more interested in reproducing GPT-2 results. So here we go - we first tokenize the dataset, in this case the [OpenWebText](https://openwebtext2.readthedocs.io/en/latest/), an open reproduction of OpenAI's (private) WebText:
106
-
107
- ```sh
108
- python data/openwebtext/prepare.py
109
- ```
110
-
111
- This downloads and tokenizes the [OpenWebText](https://huggingface.co/datasets/openwebtext) dataset. It will create a `train.bin` and `val.bin` which holds the GPT2 BPE token ids in one sequence, stored as raw uint16 bytes. Then we're ready to kick off training. To reproduce GPT-2 (124M) you'll want at least an 8X A100 40GB node and run:
112
-
113
- ```sh
114
- torchrun --standalone --nproc_per_node=8 train.py config/train_gpt2.py
115
- ```
116
-
117
- This will run for about 4 days using PyTorch Distributed Data Parallel (DDP) and go down to loss of ~2.85. Now, a GPT-2 model just evaluated on OWT gets a val loss of about 3.11, but if you finetune it it will come down to ~2.85 territory (due to an apparent domain gap), making the two models ~match.
118
-
119
- If you're in a cluster environment and you are blessed with multiple GPU nodes you can make GPU go brrrr e.g. across 2 nodes like:
120
-
121
- ```sh
122
- # Run on the first (master) node with example IP 123.456.123.456:
123
- torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=123.456.123.456 --master_port=1234 train.py
124
- # Run on the worker node:
125
- torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr=123.456.123.456 --master_port=1234 train.py
126
- ```
127
-
128
- It is a good idea to benchmark your interconnect (e.g. iperf3). In particular, if you don't have Infiniband then also prepend `NCCL_IB_DISABLE=1` to the above launches. Your multinode training will work, but most likely _crawl_. By default checkpoints are periodically written to the `--out_dir`. We can sample from the model by simply `python sample.py`.
129
-
130
- Finally, to train on a single GPU simply run the `python train.py` script. Have a look at all of its args, the script tries to be very readable, hackable and transparent. You'll most likely want to tune a number of those variables depending on your needs.
131
-
132
- ## baselines
133
-
134
- OpenAI GPT-2 checkpoints allow us to get some baselines in place for openwebtext. We can get the numbers as follows:
135
-
136
- ```sh
137
- $ python train.py config/eval_gpt2.py
138
- $ python train.py config/eval_gpt2_medium.py
139
- $ python train.py config/eval_gpt2_large.py
140
- $ python train.py config/eval_gpt2_xl.py
141
- ```
142
-
143
- and observe the following losses on train and val:
144
-
145
- | model | params | train loss | val loss |
146
- | ------| ------ | ---------- | -------- |
147
- | gpt2 | 124M | 3.11 | 3.12 |
148
- | gpt2-medium | 350M | 2.85 | 2.84 |
149
- | gpt2-large | 774M | 2.66 | 2.67 |
150
- | gpt2-xl | 1558M | 2.56 | 2.54 |
151
-
152
- However, we have to note that GPT-2 was trained on (closed, never released) WebText, while OpenWebText is just a best-effort open reproduction of this dataset. This means there is a dataset domain gap. Indeed, taking the GPT-2 (124M) checkpoint and finetuning on OWT directly for a while reaches loss down to ~2.85. This then becomes the more appropriate baseline w.r.t. reproduction.
153
-
154
- ## finetuning
155
-
156
- Finetuning is no different than training, we just make sure to initialize from a pretrained model and train with a smaller learning rate. For an example of how to finetune a GPT on new text go to `data/shakespeare` and run `prepare.py` to download the tiny shakespeare dataset and render it into a `train.bin` and `val.bin`, using the OpenAI BPE tokenizer from GPT-2. Unlike OpenWebText this will run in seconds. Finetuning can take very little time, e.g. on a single GPU just a few minutes. Run an example finetuning like:
157
-
158
- ```sh
159
- python train.py config/finetune_shakespeare.py
160
- ```
161
-
162
- This will load the config parameter overrides in `config/finetune_shakespeare.py` (I didn't tune them much though). Basically, we initialize from a GPT2 checkpoint with `init_from` and train as normal, except shorter and with a small learning rate. If you're running out of memory try decreasing the model size (they are `{'gpt2', 'gpt2-medium', 'gpt2-large', 'gpt2-xl'}`) or possibly decreasing the `block_size` (context length). The best checkpoint (lowest validation loss) will be in the `out_dir` directory, e.g. in `out-shakespeare` by default, per the config file. You can then run the code in `sample.py --out_dir=out-shakespeare`:
163
-
164
- ```
165
- THEODORE:
166
- Thou shalt sell me to the highest bidder: if I die,
167
- I sell thee to the first; if I go mad,
168
- I sell thee to the second; if I
169
- lie, I sell thee to the third; if I slay,
170
- I sell thee to the fourth: so buy or sell,
171
- I tell thee again, thou shalt not sell my
172
- possession.
173
-
174
- JULIET:
175
- And if thou steal, thou shalt not sell thyself.
176
-
177
- THEODORE:
178
- I do not steal; I sell the stolen goods.
179
-
180
- THEODORE:
181
- Thou know'st not what thou sell'st; thou, a woman,
182
- Thou art ever a victim, a thing of no worth:
183
- Thou hast no right, no right, but to be sold.
184
- ```
185
-
186
- Whoa there, GPT, entering some dark place over there. I didn't really tune the hyperparameters in the config too much, feel free to try!
187
-
188
- ## sampling / inference
189
-
190
- Use the script `sample.py` to sample either from pre-trained GPT-2 models released by OpenAI, or from a model you trained yourself. For example, here is a way to sample from the largest available `gpt2-xl` model:
191
-
192
- ```sh
193
- python sample.py \
194
- --init_from=gpt2-xl \
195
- --start="What is the answer to life, the universe, and everything?" \
196
- --num_samples=5 --max_new_tokens=100
197
- ```
198
-
199
- If you'd like to sample from a model you trained, use the `--out_dir` to point the code appropriately. You can also prompt the model with some text from a file, e.g. ```python sample.py --start=FILE:prompt.txt```.
200
-
201
- ## efficiency notes
202
-
203
- For simple model benchmarking and profiling, `bench.py` might be useful. It's identical to what happens in the meat of the training loop of `train.py`, but omits much of the other complexities.
204
-
205
- Note that the code by default uses [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/). At the time of writing (Dec 29, 2022) this makes `torch.compile()` available in the nightly release. The improvement from the one line of code is noticeable, e.g. cutting down iteration time from ~250ms / iter to 135ms / iter. Nice work PyTorch team!
206
-
207
- ## todos
208
-
209
- - Investigate and add FSDP instead of DDP
210
- - Eval zero-shot perplexities on standard evals (e.g. LAMBADA? HELM? etc.)
211
- - Finetune the finetuning script, I think the hyperparams are not great
212
- - Schedule for linear batch size increase during training
213
- - Incorporate other embeddings (rotary, alibi)
214
- - Separate out the optim buffers from model params in checkpoints I think
215
- - Additional logging around network health (e.g. gradient clip events, magnitudes)
216
- - Few more investigations around better init etc.
217
-
218
- ## troubleshooting
219
-
220
- Note that by default this repo uses PyTorch 2.0 (i.e. `torch.compile`). This is fairly new and experimental, and not yet available on all platforms (e.g. Windows). If you're running into related error messages try to disable this by adding `--compile=False` flag. This will slow down the code but at least it will run.
221
-
222
- For some context on this repository, GPT, and language modeling it might be helpful to watch my [Zero To Hero series](https://karpathy.ai/zero-to-hero.html). Specifically, the [GPT video](https://www.youtube.com/watch?v=kCc8FmEb1nY) is popular if you have some prior language modeling context.
223
-
224
- For more questions/discussions feel free to stop by **#nanoGPT** on Discord:
225
-
226
- [![](https://dcbadge.vercel.app/api/server/3zy8kqD9Cp?compact=true&style=flat)](https://discord.gg/3zy8kqD9Cp)
227
-
228
- ## acknowledgements
229
-
230
- All nanoGPT experiments are powered by GPUs on [Lambda labs](https://lambdalabs.com), my favorite Cloud GPU provider. Thank you Lambda labs for sponsoring nanoGPT!
 
1
+ ---
2
+ title: ARC125m
3
+ emoji: 🌖
4
+ colorFrom: indigo
5
+ colorTo: gray
6
+ sdk: docker
7
+ pinned: false
8
+ ---