| See `unidisc/datasets/preprocessing` for instructions on how to preprocess datasets. | |
| We support the following datasets: | |
| - Cambrian | |
| - CapsFusion | |
| - CC12M | |
| - DataComp1B | |
| - JourneyDB | |
| - LAION400M | |
| - MMC4 | |
| - PixelProse | |
| Additionally, we generated our own synthetic dataset available [here](https://huggingface.co/datasets/aswerdlow/unidisc_hq) and provide the [generation scripts](../unidisc/datasets/preprocessing/unidisc_dataset/README.md) as well as the raw data. |