File size: 5,875 Bytes
63082ea 8039cef 63082ea 8039cef |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
---
language:
- en
library_name: transformers
pipeline_tag: summarization
license: apache-2.0
tags:
- chemistry
- scientific-summarization
- distilbart
- abstractive
- tldr
- knowledge-graphs
datasets:
- Bocklitz-Lab/lit2vec-tldr-bart-dataset
model-index:
- name: lit2vec-tldr-bart
results:
- task:
name: Summarization
type: summarization
dataset:
name: Lit2Vec TL;DR Chemistry Dataset
type: Bocklitz-Lab/lit2vec-tldr-bart-dataset
split: test
size: 1001
metrics:
- type: rouge1
value: 56.11
- type: rouge2
value: 30.78
- type: rougeLsum
value: 45.43
---
# lit2vec-tldr-bart (DistilBART fine-tuned for chemistry TL;DRs)
**lit2vec-tldr-bart** is a DistilBART model fine-tuned on **19,992** CC-BY licensed chemistry abstracts to produce **concise TL;DR-style summaries** aligned with methods β results β significance. Itβs designed for scientific **abstractive summarization**, **semantic indexing**, and **knowledge-graph population** in chemistry and related fields.
- **Base model:** `sshleifer/distilbart-cnn-12-6`
- **Training data:** [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset)
- **Max input length:** 1024 tokens
- **Target length:** ~128 tokens
---
## π§ͺ Evaluation (held-out test)
| Split | ROUGE-1 | ROUGE-2 | ROUGE-Lsum |
|------:|--------:|--------:|-----------:|
| Test | **56.11** | **30.78** | **45.43** |
> Validation RLsum: 46.05
> Metrics computed with `evaluate`'s `rouge` (NLTK sentence segmentation, `use_stemmer=True`).
---
## π Quickstart
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
repo = "Bocklitz-Lab/lit2vec-tldr-bart"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)
gen = GenerationConfig.from_pretrained(repo) # loads default decoding params
text = "Proton exchange membrane fuel cells convert chemical energy into electricity..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024)
summary_ids = model.generate(**inputs, **gen.to_dict())
print(tok.decode(summary_ids[0], skip_special_tokens=True))
````
### Batch inference (PyTorch)
```python
texts = [
"Abstract 1 ...",
"Abstract 2 ...",
]
batch = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=1024)
out = model.generate(**batch, **gen.to_dict())
summaries = tok.batch_decode(out, skip_special_tokens=True)
```
---
## π§ Default decoding (saved in `generation_config.json`)
These are the defaults saved with the model (you can override at `generate()` time):
```json
{
"max_length": 142,
"min_length": 56,
"early_stopping": true,
"num_beams": 4,
"length_penalty": 2.0,
"no_repeat_ngram_size": 3,
"forced_bos_token_id": 0,
"forced_eos_token_id": 2
}
```
---
## π Training details
* **Base:** `sshleifer/distilbart-cnn-12-6` (Distilled BART)
* **Data:** 19,992 CC-BY chemistry abstracts with TL;DR summaries
* **Splits:** train=17,992 / val=999 / test=1,001
* **Max lengths:** input 1024, target 128
* **Optimizer:** AdamW, **lr=2e-5**
* **Batching:** per-device train/eval batch size 4, **gradient\_accumulation\_steps=4**
* **Epochs:** 5
* **Precision:** fp16 (when CUDA available)
* **Hardware:** single NVIDIA RTX 3090
* **Seed:** 42
* **Libraries:** π€ Transformers + Datasets, `evaluate` for ROUGE, NLTK for sentence splitting
---
## β
Intended use
* TL;DR abstractive summaries for **chemistry** and adjacent domains (materials science, chemical engineering, environmental science).
* **Semantic indexing**, **IR reranking**, and **knowledge graph** ingestion where concise method/result statements are helpful.
### Limitations & risks
* May **hallucinate** details not present in the abstract (typical for abstractive models).
* Not a substitute for expert judgment; avoid using summaries as sole evidence for scientific claims.
* Trained on CC-BY English abstracts; performance may degrade on other domains/languages.
---
## π¦ Files
This repo should include:
* `config.json`, `pytorch_model.bin` or `model.safetensors`
* `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, merges/vocab as applicable
* `generation_config.json` (decoding defaults)
---
## π Reproducibility
* Dataset: [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset)
* Recommended preprocessing: truncate inputs at 1024 tokens; targets at 128.
* ROUGE evaluation: `evaluate.load("rouge")`, NLTK sentence tokenization, `use_stemmer=True`.
---
## π Citation
If you use this model or dataset, please cite:
```bibtex
@software{lit2vec_tldr_bart_2025,
title = {lit2vec-tldr-bart: DistilBART fine-tuned for chemistry TL;DR summarization},
author = {Bocklitz Lab},
year = {2025},
url = {https://huggingface.co/Bocklitz-Lab/lit2vec-tldr-bart},
note = {Model trained on CC-BY chemistry abstracts; dataset at Bocklitz-Lab/lit2vec-tldr-bart-dataset}
}
```
Dataset:
```bibtex
@dataset{lit2vec_tldr_dataset_2025,
title = {Lit2Vec TL;DR Chemistry Dataset},
author = {Bocklitz Lab},
year = {2025},
url = {https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset}
}
```
---
## π License
* **Model weights & code:** Apache-2.0
* **Dataset:** CC BY 4.0 (attribution in per-record metadata)
---
## π Acknowledgements
* Base model: DistilBART (`sshleifer/distilbart-cnn-12-6`)
* Licensing and OA links curated from publisher/aggregator sources; dataset restricted to **CC-BY** content.
|