The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25, 2024 • 98
Evaluate & Evaluation on the Hub: Better Best Practices for Data and Model Measurements Paper • 2210.01970 • Published Sep 30, 2022 • 13
Evaluating the Social Impact of Generative AI Systems in Systems and Society Paper • 2306.05949 • Published Jun 9, 2023 • 9
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Paper • 2104.08758 • Published Apr 18, 2021
SEAL : Interactive Tool for Systematic Error Analysis and Labeling Paper • 2210.05839 • Published Oct 11, 2022
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper • 2303.03915 • Published Mar 7, 2023 • 7
Stable Bias: Analyzing Societal Representations in Diffusion Models Paper • 2303.11408 • Published Mar 20, 2023 • 1
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper • 2211.05100 • Published Nov 9, 2022 • 32