metadata
language: en
license: mit
library_name: transformers
tags:
- climate-change
- domain-adaptation
- masked-language-modeling
- scientific-nlp
- transformer
- BERT
- ClimateBERT
metrics:
- f1
model-index:
- name: SciClimateBERT
results:
- task:
type: text-classification
name: Climate NLP Tasks (ClimaBench)
dataset:
name: ClimaBench
type: benchmark
metrics:
- type: f1
name: Macro F1 (avg)
value: 57.829
SciClimateBERT 🌎🔬
SciClimateBERT is a domain-adapted version of ClimateBERT, further pretrained on peer-reviewed scientific papers focused on climate change. While ClimateBERT is tuned for general climate-related text, SciClimateBERT narrows the focus to high-quality academic content, improving performance in scientific NLP applications.
🔍 Overview
- Base Model: ClimateBERT (RoBERTa-based architecture)
- Pretraining Method: Continued pretraining (domain adaptation) with Masked Language Modeling (MLM)
- Corpus: Scientific climate change literature from top-tier journals
- Tokenizer: ClimateBERT tokenizer (unchanged)
- Language: English
- Domain: Scientific climate change research
📊 Performance
Evaluated on ClimaBench, a benchmark suite for climate-focused NLP tasks:
Metric | Value |
---|---|
Macro F1 (avg) | 57.83 |
Tasks won | 0/7 |
Avg. Std Dev | 0.01747 |
While based on ClimateBERT, this model focuses on structured scientific input, making it ideal for downstream applications in climate science and research automation.
Climate performance model card:
SciClimateBERT | |
---|---|
1. Model publicly available? | Yes |
2. Time to train final model | 300h |
3. Time for all experiments | 1,226h ~ 51 days |
4. Power of GPU and CPU | 0.250 kW + 0.013 kW |
5. Location for computations | Croatia |
6. Energy mix at location | 224.71 gCO2eq/kWh |
7. CO$_2$eq for final model | 18 kg CO2 |
8. CO$_2$eq for all experiments | 74 kg CO2 |
🧪 Intended Uses
Use for:
- Scientific climate change text classification and extraction
- Knowledge base and graph construction in climate policy and research domains
Not suitable for:
- Non-scientific general-purpose text
- Multilingual applications
Example:
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch
# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1
# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)
# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth."
# Run prediction
predictions = fill_mask(text)
# Show top predictions
print(text)
print(10*">")
for p in predictions:
print(f"{p['sequence']} — {p['score']:.4f}")
Output:
The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth.
>>>>>>>>>>
The increase in greenhouse gas ... affected the energy balance of the Earth. — 0.7897
The increase in greenhouse gas ... affected the radiation balance of the Earth. — 0.0522
The increase in greenhouse gas ... affected the mass balance of the Earth. — 0.0401
The increase in greenhouse gas ... affected the water balance of the Earth. — 0.0359
The increase in greenhouse gas ... affected the carbon balance of the Earth. — 0.0190
⚠️ Limitations
- May reflect scientific publication biases
🧾 Citation
If you use this model, please cite:
@article{poleksic_etal_2025,
title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
author={Poleksić, Andrija and
Martinčić-Ipšić, Sanda},
journal={PREPRINT (Version 1)},
year={2025},
doi={https://doi.org/10.21203/rs.3.rs-6644722/v1}
}