Masader: Metadata Sourcing for Arabic Text and Speech Data Resources Paper • 2110.06744 • Published Oct 13, 2021
Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training Paper • 2410.20796 • Published Oct 28, 2024
Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches Paper • 2307.06218 • Published Jul 12, 2023
MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs Paper • 2505.19800 • Published May 26 • 1
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic Paper • 2402.12840 • Published Feb 20, 2024 • 1
PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts Paper • 2202.01279 • Published Feb 2, 2022
Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP Paper • 2112.10508 • Published Dec 20, 2021
Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources Paper • 2201.10066 • Published Jan 25, 2022
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning Paper • 2402.06619 • Published Feb 9, 2024 • 57
CIDAR: Culturally Relevant Instruction Dataset For Arabic Paper • 2402.03177 • Published Feb 5, 2024 • 7
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset Paper • 2303.03915 • Published Mar 7, 2023 • 7
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper • 2211.05100 • Published Nov 9, 2022 • 32
Multitask Prompted Training Enables Zero-Shot Task Generalization Paper • 2110.08207 • Published Oct 15, 2021 • 2
Crosslingual Generalization through Multitask Finetuning Paper • 2211.01786 • Published Nov 3, 2022 • 2