timaeus/README · Pile Subsets Details

Thanks for the reminder.

These were created from monology/pile-uncopyrighted by streaming over the rows and filtering by the meta column. I limited the size of each subset to the first 100k I encountered.

Note that a few of the subsets don't have all 100k rows (because these are especially sparse in the full dataset and I kept hitting rate limits/running into problems). For these subsets, they aren't necessarily the first 100k, they're unevenly distributed because I wanted to parallelize it. I may in the future, try finishing these runs to increase the number of rows. That could lead to a breaking change there. I will add a readme in the meantime thanks for the suggestion.