Spaces:
Runtime error
Runtime error
Updated the README based on our current strategy
Browse files
README.md
CHANGED
|
@@ -1,5 +1,22 @@
|
|
| 1 |
## DALL-E Mini - Generate image from text
|
| 2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
## TODO
|
| 4 |
|
| 5 |
* experiment with flax/jax and setup of the TPU instance that we should get shortly
|
|
|
|
| 1 |
## DALL-E Mini - Generate image from text
|
| 2 |
|
| 3 |
+
## Tentative Strategy of training (proposed by Luke and Suraj)
|
| 4 |
+
|
| 5 |
+
### Data:
|
| 6 |
+
* [Conceptual 12M](https://github.com/google-research-datasets/conceptual-12m) Dataset (already loaded and preprocessed in TPU VM by Luke).
|
| 7 |
+
* [YFCC100M Subset](https://github.com/openai/CLIP/blob/main/data/yfcc100m.md)
|
| 8 |
+
* [Coneptual Captions 3M](https://github.com/google-research-datasets/conceptual-captions)
|
| 9 |
+
|
| 10 |
+
### Architecture:
|
| 11 |
+
* Use the Taming Transformers VQ-GAN (with 16384 tokens)
|
| 12 |
+
* Use a seq2seq (language encoder --> image decoder) model with a pretrained non-autoregressive encoder (e.g. BERT) and an autoregressive decoder (like GPT).
|
| 13 |
+
|
| 14 |
+
### Remaining Architecture Questions:
|
| 15 |
+
* Whether to freeze the text encoder?
|
| 16 |
+
* Whether to finetune the VQ-GAN?
|
| 17 |
+
* Which text encoder to use (e.g. BERT, RoBERTa, etc.)?
|
| 18 |
+
* Hyperparameter choices for the decoder (e.g. positional embedding, initialization, etc.)
|
| 19 |
+
|
| 20 |
## TODO
|
| 21 |
|
| 22 |
* experiment with flax/jax and setup of the TPU instance that we should get shortly
|