NikG100 commited on
Commit
a21a519
·
verified ·
1 Parent(s): 73587b0

Upload 4 files

Browse files
Files changed (4) hide show
  1. README.md +80 -0
  2. special_tokens_map.json +51 -0
  3. tokenizer_config.json +57 -0
  4. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # BART-Based Text Summarization Model for News Aggregation
2
+
3
+ This repository hosts a BART transformer model fine-tuned for abstractive text summarization of news articles. It is designed to condense lengthy news reports into concise, informative summaries, enhancing user experience for news readers and aggregators.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Architecture:** BART (Facebook's BART-base)
8
+ - **Task:** Abstractive Text Summarization
9
+ - **Domain:** News Articles
10
+ - **Dataset:** Reddit-TIFU (Hugging Face Datasets)
11
+ - **Fine-tuning Framework:** Hugging Face Transformers
12
+
13
+ ## Usage
14
+
15
+ ### Installation
16
+
17
+ ```bash
18
+ pip install datasets transformers rouge-score evaluate
19
+ ```
20
+
21
+ ### Loading the Model
22
+
23
+ ```python
24
+ from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments, DataCollatorForSeq2Seq
25
+ import torch
26
+
27
+ # Load tokenizer and model
28
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
29
+ model_name = "facebook/bart-base"
30
+ tokenizer = BartTokenizer.from_pretrained(model_name)
31
+ model = BartForConditionalGeneration.from_pretrained(model_name).to(device)
32
+ ```
33
+
34
+ ## Performance Metrics
35
+
36
+ - **Rouge1 :** 25.500000
37
+ - **Rouge2 :** 7.860000
38
+ - **Rougel :** 20.640000
39
+ - **Rougelsum :** 21.180000
40
+
41
+
42
+ ## Fine-Tuning Details
43
+
44
+ ### Dataset
45
+
46
+ The dataset is sourced from Hugging Face’s Reddit-TIFU dataset. It contains 79,000 reddit post and their summaries.
47
+ The original training and testing sets were merged, shuffled, and re-split using an 90/10 ratio.
48
+
49
+ ### Training Configuration
50
+
51
+ - **Epochs:** 3
52
+ - **Batch Size:** 8
53
+ - **Learning Rate:** 2e-5
54
+ - **Evaluation Strategy:** epoch
55
+
56
+ ### Quantization
57
+
58
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
59
+
60
+ ## Repository Structure
61
+
62
+ ```
63
+ .
64
+ ├── config.json
65
+ ├── tokenizer_config.json
66
+ ├── sepcial_tokens_map.json
67
+ ├── tokenizer.json
68
+ ├── model.safetensors # Fine Tuned Model
69
+ ├── README.md # Model documentation
70
+ ```
71
+
72
+ ## Limitations
73
+
74
+ - The model may not generalize well to domains outside the fine-tuning dataset.
75
+
76
+ - Quantization may result in minor accuracy degradation compared to full-precision models.
77
+
78
+ ## Contributing
79
+
80
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": true,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "model_max_length": 1000000000000000019884624838656,
53
+ "pad_token": "<pad>",
54
+ "sep_token": "</s>",
55
+ "tokenizer_class": "BartTokenizer",
56
+ "unk_token": "<unk>"
57
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff