File size: 5,875 Bytes
63082ea
8039cef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63082ea
8039cef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
---

language:
  - en
library_name: transformers
pipeline_tag: summarization
license: apache-2.0
tags:
  - chemistry
  - scientific-summarization
  - distilbart
  - abstractive
  - tldr
  - knowledge-graphs
datasets:
  - Bocklitz-Lab/lit2vec-tldr-bart-dataset
model-index:
- name: lit2vec-tldr-bart
  results:
  - task:
      name: Summarization
      type: summarization
    dataset:
      name: Lit2Vec TL;DR Chemistry Dataset
      type: Bocklitz-Lab/lit2vec-tldr-bart-dataset
      split: test
      size: 1001
    metrics:
      - type: rouge1
        value: 56.11
      - type: rouge2
        value: 30.78
      - type: rougeLsum
        value: 45.43
---


# lit2vec-tldr-bart (DistilBART fine-tuned for chemistry TL;DRs)

**lit2vec-tldr-bart** is a DistilBART model fine-tuned on **19,992** CC-BY licensed chemistry abstracts to produce **concise TL;DR-style summaries** aligned with methods β†’ results β†’ significance. It’s designed for scientific **abstractive summarization**, **semantic indexing**, and **knowledge-graph population** in chemistry and related fields.

- **Base model:** `sshleifer/distilbart-cnn-12-6`
- **Training data:** [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset)
- **Max input length:** 1024 tokens  
- **Target length:** ~128 tokens

---

## πŸ§ͺ Evaluation (held-out test)

| Split | ROUGE-1 | ROUGE-2 | ROUGE-Lsum |
|------:|--------:|--------:|-----------:|
| Test  | **56.11** | **30.78** | **45.43** |

> Validation RLsum: 46.05  
> Metrics computed with `evaluate`'s `rouge` (NLTK sentence segmentation, `use_stemmer=True`).



---



## πŸš€ Quickstart



```python

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig



repo = "Bocklitz-Lab/lit2vec-tldr-bart"



tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForSeq2SeqLM.from_pretrained(repo)

gen = GenerationConfig.from_pretrained(repo)  # loads default decoding params

text = "Proton exchange membrane fuel cells convert chemical energy into electricity..."
inputs = tok(text, return_tensors="pt", truncation=True, max_length=1024)

summary_ids = model.generate(**inputs, **gen.to_dict())
print(tok.decode(summary_ids[0], skip_special_tokens=True))

````



### Batch inference (PyTorch)



```python

texts = [

  "Abstract 1 ...",

  "Abstract 2 ...",

]

batch = tok(texts, return_tensors="pt", padding=True, truncation=True, max_length=1024)

out = model.generate(**batch, **gen.to_dict())
summaries = tok.batch_decode(out, skip_special_tokens=True)

```



---



## πŸ”§ Default decoding (saved in `generation_config.json`)

These are the defaults saved with the model (you can override at `generate()` time):

```json

{

  "max_length": 142,

  "min_length": 56,

  "early_stopping": true,

  "num_beams": 4,

  "length_penalty": 2.0,

  "no_repeat_ngram_size": 3,

  "forced_bos_token_id": 0,

  "forced_eos_token_id": 2

}

```

---

## πŸ“Š Training details

* **Base:** `sshleifer/distilbart-cnn-12-6` (Distilled BART)
* **Data:** 19,992 CC-BY chemistry abstracts with TL;DR summaries
* **Splits:** train=17,992 / val=999 / test=1,001
* **Max lengths:** input 1024, target 128
* **Optimizer:** AdamW, **lr=2e-5**
* **Batching:** per-device train/eval batch size 4, **gradient\_accumulation\_steps=4**
* **Epochs:** 5
* **Precision:** fp16 (when CUDA available)
* **Hardware:** single NVIDIA RTX 3090
* **Seed:** 42
* **Libraries:** πŸ€— Transformers + Datasets, `evaluate` for ROUGE, NLTK for sentence splitting

---

## βœ… Intended use

* TL;DR abstractive summaries for **chemistry** and adjacent domains (materials science, chemical engineering, environmental science).
* **Semantic indexing**, **IR reranking**, and **knowledge graph** ingestion where concise method/result statements are helpful.

### Limitations & risks

* May **hallucinate** details not present in the abstract (typical for abstractive models).
* Not a substitute for expert judgment; avoid using summaries as sole evidence for scientific claims.
* Trained on CC-BY English abstracts; performance may degrade on other domains/languages.

---

## πŸ“¦ Files

This repo should include:

* `config.json`, `pytorch_model.bin` or `model.safetensors`
* `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, merges/vocab as applicable
* `generation_config.json` (decoding defaults)

---

## πŸ” Reproducibility

* Dataset: [`Bocklitz-Lab/lit2vec-tldr-bart-dataset`](https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset)
* Recommended preprocessing: truncate inputs at 1024 tokens; targets at 128.
* ROUGE evaluation: `evaluate.load("rouge")`, NLTK sentence tokenization, `use_stemmer=True`.

---

## πŸ“š Citation

If you use this model or dataset, please cite:

```bibtex

@software{lit2vec_tldr_bart_2025,

  title   = {lit2vec-tldr-bart: DistilBART fine-tuned for chemistry TL;DR summarization},

  author  = {Bocklitz Lab},

  year    = {2025},

  url     = {https://huggingface.co/Bocklitz-Lab/lit2vec-tldr-bart},

  note    = {Model trained on CC-BY chemistry abstracts; dataset at Bocklitz-Lab/lit2vec-tldr-bart-dataset}

}

```

Dataset:

```bibtex

@dataset{lit2vec_tldr_dataset_2025,

  title   = {Lit2Vec TL;DR Chemistry Dataset},

  author  = {Bocklitz Lab},

  year    = {2025},

  url     = {https://huggingface.co/datasets/Bocklitz-Lab/lit2vec-tldr-bart-dataset}

}

```

---

## πŸ“ License

* **Model weights & code:** Apache-2.0
* **Dataset:** CC BY 4.0 (attribution in per-record metadata)

---

## πŸ™Œ Acknowledgements

* Base model: DistilBART (`sshleifer/distilbart-cnn-12-6`)
* Licensing and OA links curated from publisher/aggregator sources; dataset restricted to **CC-BY** content.