hgissbkh commited on
Commit
aec6f09
·
verified ·
1 Parent(s): 5dfe398

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -11
README.md CHANGED
@@ -1,8 +1,6 @@
1
  # Should We Still Pretrain Encoders with Masked Language Modeling?
2
 
3
- [![arXiv](https://img.shields.io/badge/arXiv-2503.05500-b31b1b.svg?style=for-the-badge)](http://arxiv.org/abs/2507.00994)
4
- [![GitHub](https://img.shields.io/badge/Code_Repository-100000?style=for-the-badge&logo=github&logoColor=white)](https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM)
5
- [![Blog Post](https://img.shields.io/badge/Blog_Post-018EF5?logo=readme&logoColor=fff&style=for-the-badge)](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)
6
 
7
  ![MLMvsCLM](https://raw.githubusercontent.com/Nicolas-BZRD/EuroBERT/refs/heads/MLM_vs_CLM/docs/images/hf_card_without.png)
8
 
@@ -10,23 +8,32 @@
10
 
11
  Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models.
12
 
13
- ## Ressources
14
 
15
- - **[Hugging Face Project Page](https://huggingface.co/MLMvsCLM)**: The Hugging Face page that centralizes everything!
16
- - **[Preprint](https://arxiv.org/abs/2507.00994)**: The paper with all the details!
17
- - **[Blog](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)**: A blog post summarizing the paper in a 5-minute read.
18
- - **[GitHub Codebase](https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM)**: *Optimus Training Library* — a scalable distributed training framework for training encoders at scale.
19
- - **[Model: EuroBERT](https://huggingface.co/EuroBERT)**: The model architecture used in the experiments.
20
 
 
21
 
22
- ## Contact of the first-authors
 
 
 
 
 
 
 
 
23
 
24
  - Hippolyte Gisserot-Boukhlef : [email protected]
25
  - Nicolas Boizard : [email protected]
26
 
27
  ## Citation
28
 
29
- If you use this project in your research, please cite the original paper as follows:
30
 
31
  ```bibtex
32
  @misc{gisserotboukhlef2025pretrainencodersmaskedlanguage,
 
1
  # Should We Still Pretrain Encoders with Masked Language Modeling?
2
 
3
+ This page gathers all the artefacts and references related to the paper “Should We Still Pretrain Encoders with Masked Language Modeling?” (Gisserot-Boukhlef et al.)
 
 
4
 
5
  ![MLMvsCLM](https://raw.githubusercontent.com/Nicolas-BZRD/EuroBERT/refs/heads/MLM_vs_CLM/docs/images/hf_card_without.png)
6
 
 
8
 
9
  Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models.
10
 
11
+ ## Resources
12
 
13
+ - **[Preprint](https://arxiv.org/abs/2507.00994)**: For the full details of our work
14
+ - **[Blog post](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)**: A quick overview if you only have 5 minutes
15
+ - **[EuroBERT](https://huggingface.co/EuroBERT)**: The encoder model architecture used in our experiments
16
+ - **[Training codebase](https://github.com/Nicolas-BZRD/EuroBERT/tree/MLM_vs_CLM)**: *Optimus*, our distributed framework for training encoders at scale
17
+ - **[Evaluation codebase](https://github.com/hgissbkh/EncodEval/tree/MLM_vs_CLM)**: *EncodEval*, our framework for evaluating encoder models across a wide range of representation tasks
18
 
19
+ ## Models
20
 
21
+ We release all the models trained and evaluated in the paper.
22
+
23
+ * Model names follow the format `[model size]-[objective]-[number of steps]`: e.g., `610m-clm-42k` refers to a 610M-parameter model trained with CLM for 42k steps.
24
+ * For models trained in two stages, names follow the extended format `[model size]-[objective #1]-[number of steps #1]-[objective #2]-[number of steps #2]`, `[number of steps #2]` indicates the total number of training steps: e.g., `610m-clm-10k-mlm40-42k` is a 610M model first trained with CLM for 10k steps, then further trained with MLM (using a 40% masking ratio) for an additional 32k steps, totaling 42k.
25
+ * Models that were continued from a decayed checkpoint use the "dec" prefix for the first step count: e.g., `610m-clm-dec42k-mlm40-64k` represents a 610M model first trained and decayed with CLM for 42k steps, then continued with MLM (40% masking ratio) for 22k more steps, totaling 64k.
26
+ * By default, model names refer to the final checkpoint. Intermediate checkpoints are indicated by appending the step number at the end: e.g., `610m-mlm40-42k-1000` corresponds to checkpoint 1,000 of a 610M model trained with MLM (40% masking) for 42k steps.
27
+
28
+
29
+ ## First authors' contact information
30
 
31
  - Hippolyte Gisserot-Boukhlef : [email protected]
32
  - Nicolas Boizard : [email protected]
33
 
34
  ## Citation
35
 
36
+ If you found our work useful, please consider citing our paper:
37
 
38
  ```bibtex
39
  @misc{gisserotboukhlef2025pretrainencodersmaskedlanguage,