przvl commited on
Commit
6580e14
·
verified ·
1 Parent(s): da64fe8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +192 -0
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ language:
5
+ - de
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - populism
9
+ - political-speech
10
+ - classification
11
+ - german
12
+ - Bundestag
13
+ - NLP
14
+ base_model:
15
+ - EuroBERT/EuroBERT-210m
16
+ ---
17
+
18
+ # PopEuroBERT-610m
19
+
20
+ ## Binary Populism Classifier for German Bundestag Speeches
21
+
22
+ ## Table of Contents
23
+
24
+ 1. [Overview](#overview)
25
+ 2. [Usage](#usage)
26
+ 3. [Training Data](#training-data)
27
+ 4. [Training Procedure](#training-procedure)
28
+ 5. [Evaluation](#evaluation)
29
+ 6. [Limitations](#limitations)
30
+ 7. [Ethical Considerations](#ethical-considerations)
31
+ 8. [License](#license)
32
+ 9. [Citation](#citation)
33
+
34
+ ## Overview
35
+
36
+ This model is a fine-tuned version of [EuroBERT-210m](https://huggingface.co/EuroBERT/EuroBERT-210m) on the [PopBERT](https://huggingface.co/luerhard/PopBERT) dataset (German Bundestag speeches) for **populist rhetoric classification**. It predicts whether a given speech excerpt contains populist language.
37
+
38
+ **Key Features:**
39
+
40
+ - Trained on **German Bundestag speeches** annotated for populism.
41
+ - Fine-tuned using **5-fold cross-validation**.
42
+ - Optimized with **decision threshold tuning**.
43
+
44
+ ## Usage
45
+
46
+ To use the model in Python:
47
+
48
+ ```python
49
+ import torch
50
+ from transformers import AutoTokenizer
51
+ from transformers import AutoModelForSequenceClassification
52
+
53
+
54
+ model_id = "przvl/PopEuroBERT-binary-610m"
55
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
56
+ model = AutoModelForSequenceClassification.from_pretrained(
57
+ model_id, trust_remote_code=True
58
+ )
59
+
60
+
61
+ # define text to be predicted
62
+ text = (
63
+ "Aber Ihnen fehlt eben der Mut, Ihnen fehlen die Visionen, um sich"
64
+ "gegen die Konzerne und gegen die Lobbygruppen zur Wehr zu setzen."
65
+ )
66
+
67
+ inputs = tokenizer(text, return_tensors="pt")
68
+ outputs = model(**inputs)
69
+
70
+ # get classification probability
71
+ logits = outputs.logits
72
+ probs = torch.softmax(logits, dim=-1) # shape [1, 2]
73
+ populist_prob = probs[0, 1].item() # probability of class=1 (populist)
74
+
75
+ # use decision threshold 0.43
76
+ threshold = 0.43
77
+ label = "Populist" if populist_prob > threshold else "Neutral"
78
+ print(f"Predicted class: {label} (Confidence: {populist_prob:.2f})")
79
+ ```
80
+
81
+ ```text
82
+ Predicted class: Populist (Confidence: 0.62)
83
+ ```
84
+
85
+ **Use decision threshold `0.43` for balanced [performance](#evaluation).**
86
+
87
+ ## Training Data
88
+
89
+ - **Dataset:** [PopBERT](https://github.com/luerhard/PopBERT)
90
+ - Sentence-level annotated German Bundestag speeches
91
+ - `train/test: 7017/1758`
92
+ - **Preprocessing:**
93
+ - Converted labels to binary format (`populist = 1`, `neutral = 0`).
94
+ - Tokenized using **EuroBERT tokenizer** with a max length of `256` tokens.
95
+
96
+ ## Training Procedure
97
+
98
+ - **Base Model:** [EuroBERT-210M](https://huggingface.co/EuroBERT/EuroBERT-210m)
99
+ - **Fine-tuning Approach:**
100
+ - Used **Hugging Face Trainer** for training.
101
+ - Applied **5-fold cross-validation**.
102
+ - **Decision threshold tuning** on aggregated predictions.
103
+
104
+ ### Hyperparameters
105
+
106
+ | Parameter | Value |
107
+ | --------------------- | ------- |
108
+ | Learning Rate | `1e-05` |
109
+ | Weight Decay | `0.1` |
110
+ | Gradient Accumulation | `1` |
111
+ | Warmup Ratio | `0.1` |
112
+ | Epochs | `3` |
113
+ | Batch Size | `128` |
114
+ | Max Length | `256` |
115
+
116
+ - **Mixed Precision (bf16):** Used for efficiency on GPU.
117
+
118
+ ## Evaluation
119
+
120
+ ### Test Set Results (Threshold = 0.5)
121
+
122
+ | Metric | Score |
123
+ | --------- | ------ |
124
+ | Accuracy | 80.26% |
125
+ | Precision | 78.42% |
126
+ | Recall | 83.50% |
127
+ | F1 Score | 80.89% |
128
+ | Loss | 0.4631 |
129
+
130
+ ### Test Set Results (Optimized Threshold = 0.43)
131
+
132
+ | Metric | Score |
133
+ | --------- | ------ |
134
+ | Accuracy | 79.81% |
135
+ | Precision | 76.63% |
136
+ | Recall | 85.78% |
137
+ | F1 Score | 80.94% |
138
+
139
+ ## Limitations
140
+
141
+ - **Domain Specificity:**
142
+ This model was trained on Bundestag speeches and may not generalize to all political discourse.
143
+ - **Threshold Sensitivity:**
144
+ The decision threshold (`0.43`) was optimized for this dataset but may need adjustment for other corpora.
145
+ - **Potential Bias:**
146
+ Political speech contains biases inherent in dataset labeling.
147
+
148
+ ## Ethical Considerations
149
+
150
+ - **Not suitable for high-stakes decision-making.**
151
+ This model is meant for **research purposes** in political discourse analysis.
152
+ - **Bias & Context Dependence:**
153
+ Populism is a complex concept. Automated detection should **not replace** human interpretation.
154
+ - **Transparent Use:**
155
+ Users should document and validate model outputs in their research.
156
+
157
+ ## License
158
+
159
+ Released under the Apache **2.0 License**.
160
+
161
+ ## Citation
162
+
163
+ If you use this model or its methodology, please cite:
164
+
165
+ - **The original EuroBERT paper:**
166
+
167
+ ```bibtex
168
+ @misc{boizard2025eurobertscalingmultilingualencoders,
169
+ title={EuroBERT: Scaling Multilingual Encoders for European Languages},
170
+ author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
171
+ year={2025},
172
+ eprint={2503.05500},
173
+ archivePrefix={arXiv},
174
+ primaryClass={cs.CL},
175
+ url={https://arxiv.org/abs/2503.05500}
176
+ }
177
+ ```
178
+
179
+ - **The PopBERT dataset source:**
180
+
181
+ ```bibtex
182
+ @article{Erhard_Hanke_Remer_Falenska_Heiberger_2025,
183
+ title={PopBERT. Detecting Populism and Its Host Ideologies in the German Bundestag},
184
+ volume={33},
185
+ DOI={10.1017/pan.2024.12},
186
+ number={1},
187
+ journal={Political Analysis},
188
+ author={Erhard, Lukas and Hanke, Sara and Remer, Uwe and Falenska, Agnieszka and Heiberger, Raphael Heiko},
189
+ year={2025},
190
+ pages={1–17}
191
+ }
192
+ ```