The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper • 2303.03915 • Published Mar 7, 2023 • 7
StarCoder 2 and The Stack v2: The Next Generation Paper • 2402.19173 • Published Feb 29, 2024 • 147
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Paper • 2202.01279 • Published Feb 2, 2022
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Paper • 2112.10508 • Published Dec 20, 2021
How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts Paper • 2205.10762 • Published May 22, 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization Paper • 2110.08207 • Published Oct 15, 2021 • 2