Bridging the Data Provenance Gap Across Text, Speech and Video Paper • 2412.17847 • Published Dec 19, 2024 • 10
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5, 2025 • 60