Zaid Alyafeai's picture

Zaid Alyafeai

Zaid

·

https://github.com/zaidalyafeai

AI & ML interests

Arabic Language Modeling

Recent Activity

updated a model 14 days ago

IVUL-KAUST/MeXtract-3B

authored a paper 16 days ago

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

published a model 17 days ago

IVUL-KAUST/MeXtract-0.5B

View all activity

Organizations

authored a paper 16 days ago

MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Paper • 2510.06889 • Published 18 days ago • 1

authored 5 papers 5 months ago

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Paper • 2110.06744 • Published Oct 13, 2021

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

Paper • 2412.04277 • Published Dec 5, 2024

Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Paper • 2410.20796 • Published Oct 28, 2024

Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches

Paper • 2307.06218 • Published Jul 12, 2023

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Paper • 2505.19800 • Published May 26 • 2

authored 6 papers over 1 year ago

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic

Paper • 2402.12840 • Published Feb 20, 2024 • 1

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

Paper • 2202.01279 • Published Feb 2, 2022

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Paper • 2112.10508 • Published Dec 20, 2021

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

Paper • 2201.10066 • Published Jan 25, 2022

Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning

Paper • 2402.06619 • Published Feb 9, 2024 • 56

CIDAR: Culturally Relevant Instruction Dataset For Arabic

Paper • 2402.03177 • Published Feb 5, 2024 • 7

authored 4 papers over 2 years ago

The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset

Paper • 2303.03915 • Published Mar 7, 2023 • 7

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Paper • 2211.05100 • Published Nov 9, 2022 • 34

Multitask Prompted Training Enables Zero-Shot Task Generalization

Paper • 2110.08207 • Published Oct 15, 2021 • 2

Crosslingual Generalization through Multitask Finetuning

Paper • 2211.01786 • Published Nov 3, 2022 • 2