|
# T5-Small Transformer Model for News Text Summarization |
|
|
|
This repository hosts a fine-tuned version of the T5-small Transformer model for abstractive text summarization. Trained on the CNN-DailyMail News dataset, this model generates concise and meaningful summaries from long-form news articles. It is well-suited for applications like news digest creation, content summarization engines, and information extraction systems. |
|
|
|
## Model Details |
|
|
|
- **Model Architecture:** T5-small Transformer |
|
- **Task:** Abstractive Text Summarization |
|
- **Dataset:** CNN-DailyMail News Text Summarization Dataset |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```sh |
|
pip install transformers torch |
|
``` |
|
|
|
### Loading the Model |
|
|
|
```python |
|
from transformers import T5ForConditionalGeneration, T5Tokenizer |
|
import torch |
|
|
|
# Load model and tokenizer |
|
model_name = "AventIQ-AI/t5-small-news-text-summarization" |
|
model = T5ForConditionalGeneration.from_pretrained(model_name) |
|
tokenizer = T5Tokenizer.from_pretrained("t5-small") |
|
|
|
# Set model to evaluation mode |
|
model.eval() |
|
|
|
# Example input |
|
article_text = """ |
|
NASAβs Perseverance rover has successfully collected samples from Mars that may contain signs of ancient microbial life. |
|
These samples will eventually be returned to Earth as part of an ambitious mission involving NASA and the European Space Agency. |
|
""" |
|
|
|
# Preprocess input |
|
input_text = "summarize: " + article_text.strip().replace("\n", " ") |
|
inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True) |
|
|
|
# Generate summary |
|
with torch.no_grad(): |
|
summary_ids = model.generate( |
|
inputs["input_ids"], |
|
num_beams=4, |
|
length_penalty=2.0, |
|
max_length=150, |
|
early_stopping=True |
|
) |
|
|
|
# Decode and print summary |
|
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True) |
|
print(f"Summary:\n{summary}") |
|
``` |
|
|
|
## Performance Metrics |
|
|
|
- **ROUGE-L Score:** 0.35 (on CNN-DailyMail validation set) |
|
- **BLEU Score:** 0.27 |
|
|
|
## Fine-Tuning Details |
|
|
|
### Dataset |
|
|
|
The model was fine-tuned on the [CNN-DailyMail News dataset](https://huggingface.co/datasets/cnn_dailymail), which contains pairs of news articles and human-written summaries. |
|
|
|
### Training |
|
|
|
- Number of epochs: 4 |
|
- Batch size: 16 |
|
- Evaluation strategy: epoch |
|
- Learning rate: 3e-4 |
|
- Optimizer: AdamW |
|
|
|
## Repository Structure |
|
|
|
``` |
|
. |
|
βββ model/ # Fine-tuned model files |
|
βββ tokenizer_config/ # Tokenizer configuration and vocab files |
|
βββ model.safensors/ # Model checkpoint (optional) |
|
βββ README.md # Model documentation |
|
``` |
|
|
|
## Limitations |
|
|
|
- The model may struggle with extremely technical or domain-specific texts outside the news genre. |
|
- Summaries may occasionally lose factual accuracy in favor of fluency and brevity. |
|
|
|
## Contributing |
|
|
|
Contributions are welcome! Feel free to open an issue or submit a pull request with suggestions, improvements, or bug fixes. |
|
|