Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,41 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Should We Still Pretrain Encoders with Masked Language Modeling?
|
2 |
+
|
3 |
+
[](http://arxiv.org/abs/2507.00994)
|
4 |
+
[](https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM)
|
5 |
+
[](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)
|
6 |
+
|
7 |
+

|
8 |
+
|
9 |
+
## Abstract
|
10 |
+
|
11 |
+
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models.
|
12 |
+
|
13 |
+
## Ressources
|
14 |
+
|
15 |
+
- **[Hugging Face Project Page](https://huggingface.co/MLMvsCLM)**: The Hugging Face page that centralizes everything!
|
16 |
+
- **[Preprint](https://arxiv.org/abs/2507.00994)**: The paper with all the details!
|
17 |
+
- **[Blog](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)**: A blog post summarizing the paper in a 5-minute read.
|
18 |
+
- **[GitHub Codebase](https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM)**: *Optimus Training Library* — a scalable distributed training framework for training encoders at scale.
|
19 |
+
- **[Model: EuroBERT](https://huggingface.co/EuroBERT)**: The model architecture used in the experiments.
|
20 |
+
|
21 |
+
|
22 |
+
## Contact of the first-authors
|
23 |
+
|
24 |
+
- Hippolyte Gisserot-Boukhlef : [email protected]
|
25 |
+
- Nicolas Boizard : [email protected]
|
26 |
+
|
27 |
+
## Citation
|
28 |
+
|
29 |
+
If you use this project in your research, please cite the original paper as follows:
|
30 |
+
|
31 |
+
```bibtex
|
32 |
+
@misc{gisserotboukhlef2025pretrainencodersmaskedlanguage,
|
33 |
+
title={Should We Still Pretrain Encoders with Masked Language Modeling?},
|
34 |
+
author={Hippolyte Gisserot-Boukhlef and Nicolas Boizard and Manuel Faysse and Duarte M. Alves and Emmanuel Malherbe and André F. T. Martins and Céline Hudelot and Pierre Colombo},
|
35 |
+
year={2025},
|
36 |
+
eprint={2507.00994},
|
37 |
+
archivePrefix={arXiv},
|
38 |
+
primaryClass={cs.CL},
|
39 |
+
url={https://arxiv.org/abs/2507.00994},
|
40 |
+
}
|
41 |
+
```
|