sciclimatebert / README.md
P0L3's picture
Update README.md
6ace2d9 verified
metadata
language: en
license: mit
library_name: transformers
tags:
  - climate-change
  - domain-adaptation
  - masked-language-modeling
  - scientific-nlp
  - transformer
  - BERT
  - ClimateBERT
metrics:
  - f1
model-index:
  - name: SciClimateBERT
    results:
      - task:
          type: text-classification
          name: Climate NLP Tasks (ClimaBench)
        dataset:
          name: ClimaBench
          type: benchmark
        metrics:
          - type: f1
            name: Macro F1 (avg)
            value: 57.829

SciClimateBERT 🌎🔬

SciClimateBERT is a domain-adapted version of ClimateBERT, further pretrained on peer-reviewed scientific papers focused on climate change. While ClimateBERT is tuned for general climate-related text, SciClimateBERT narrows the focus to high-quality academic content, improving performance in scientific NLP applications.

🔍 Overview

  • Base Model: ClimateBERT (RoBERTa-based architecture)
  • Pretraining Method: Continued pretraining (domain adaptation) with Masked Language Modeling (MLM)
  • Corpus: Scientific climate change literature from top-tier journals
  • Tokenizer: ClimateBERT tokenizer (unchanged)
  • Language: English
  • Domain: Scientific climate change research

📊 Performance

Evaluated on ClimaBench, a benchmark suite for climate-focused NLP tasks:

Metric Value
Macro F1 (avg) 57.83
Tasks won 0/7
Avg. Std Dev 0.01747

While based on ClimateBERT, this model focuses on structured scientific input, making it ideal for downstream applications in climate science and research automation.

Climate performance model card:

SciClimateBERT
1. Model publicly available? Yes
2. Time to train final model 300h
3. Time for all experiments 1,226h ~ 51 days
4. Power of GPU and CPU 0.250 kW + 0.013 kW
5. Location for computations Croatia
6. Energy mix at location 224.71 gCO2eq/kWh
7. CO$_2$eq for final model 18 kg CO2
8. CO$_2$eq for all experiments 74 kg CO2

🧪 Intended Uses

Use for:

  • Scientific climate change text classification and extraction
  • Knowledge base and graph construction in climate policy and research domains

Not suitable for:

  • Non-scientific general-purpose text
  • Multilingual applications

Example:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']}{p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the <mask> balance of the Earth.
>>>>>>>>>>
The increase in greenhouse gas ... affected the energy balance of the Earth. — 0.7897
The increase in greenhouse gas ... affected the radiation balance of the Earth. — 0.0522
The increase in greenhouse gas ... affected the mass balance of the Earth. — 0.0401
The increase in greenhouse gas ... affected the water balance of the Earth. — 0.0359
The increase in greenhouse gas ... affected the carbon balance of the Earth. — 0.0190

⚠️ Limitations

  • May reflect scientific publication biases

🧾 Citation

If you use this model, please cite:

@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksić, Andrija  and
      Martinčić-Ipšić, Sanda},
  journal={PREPRINT (Version 1)},
  year={2025},
  doi={https://doi.org/10.21203/rs.3.rs-6644722/v1}
}