Spaces:
Running
Pile Subsets Details
Hey, thanks for your work!
Are there any plans to release details of the filtering process?
I'm planning to use them and I (and I think not only me) would benefit from clear transparency regarding the subset creation process (is it first 100k rows, or is it some seeded shuffle, etc.).
Just adding these details in the Readme would be very helpful, thank you!
Thanks for the reminder.
These were created from monology/pile-uncopyrighted by streaming over the rows and filtering by the meta
column. I limited the size of each subset to the first 100k I encountered.
Note that a few of the subsets don't have all 100k rows (because these are especially sparse in the full dataset and I kept hitting rate limits/running into problems). For these subsets, they aren't necessarily the first 100k, they're unevenly distributed because I wanted to parallelize it. I may in the future, try finishing these runs to increase the number of rows. That could lead to a breaking change there. I will add a readme in the meantime thanks for the suggestion.