Text Summarization

The model used in this summarization task is a T5 summarization transformer-based language model fine-tuned for abstractive summarization.

This model is intended to summarize political texts regarding generates summaries by treating text summarization as a text-to-text problem, where both the input and the output are sequences of text.

The model was fine-tuned on 10k political party press releases from 66 parties in 12 different countries via an abstract summary.

Model Details

Pretrained Model: The model uses a pretrained tokenizer and model from the Hugging Face transformers library (e.g., T5ForConditionalGeneration).

Tokenization: Text is tokenized using a subword tokenizer, where long words are split into smaller, meaningful subwords.

Input Processing: The model processes the input sequence by truncating or padding the text to fit within the max_input_length of 512 tokens.

Output Generation: The model generates the summary through a text generation process using beam search with a beam width of 4 to explore multiple possible summary sequences at each step.

Key Parameters:

Max Input Length: 512 tokens — ensures the input text is truncated or padded to fit within the model's processing capacity.

Max Target Length: 128 tokens — restricts the length of the generated summary, balancing between concise output and content preservation.

Beam Search: Uses a beam width of 10 to explore multiple candidate sequences during generation, helping the model choose the most probable summary.

Early Stopping: The generation process stops early if the model predicts the end of the sequence before reaching the maximum target length.

Generation Process:

Input Tokenization: The input text is tokenized into subword units and passed into the model.

Beam Search: The model generates the next token by considering the top 10 possible sequences at each step, aiming to find the most probable summary sequence.

Output Decoding: The generated summary is decoded from token IDs back into human-readable text using the tokenizer, skipping special tokens like padding or end-of-sequence markers.

Repository: https://github.com/tcdickson/Text-Summarization.git

Training Details

The summarization model was trained on a dataset of press releases scraped from various party websites. These press releases were selected to represent diverse political perspectives and topics, ensuring that the model learned to generate summaries across a wide range of political content. Data Collection:

Source: Press releases from official party websites, which often contain detailed statements, policy announcements, and responses to current events. These documents were chosen because of their structured format and consistent language use.

Preprocessing: The scraped text was cleaned and preprocessed, removing extraneous HTML tags, irrelevant information, and ensuring that the text content was well-formatted for model training.

Text Format: The press releases were processed into suitable text pairs: the original full text as the input and a human-crafted summary (if available) or a custom summary generated by the developers as the target output.

Training Objective:

The model was fine-tuned using these press releases to learn the task of abstractive summarization — generating concise, fluent summaries of longer political texts.

The model was trained to capture key information and context, while avoiding irrelevant details, ensuring that it could produce summaries that accurately reflect the essence of each release.

Training Strategy:

Supervised Learning: The model was trained using supervised learning, where each input (press release) was paired with a corresponding summary.

Optimization: During training, the model's parameters were adjusted using gradient descent and the cross-entropy loss function.

This training process allowed the model to learn not only the specific language patterns commonly found in political press releases but also the broader context of political discourse.

Citation:

@article{dickson2024going, title={Going against the grain: Climate change as a wedge issue for the radical right}, author={Dickson, Zachary P and Hobolt, Sara B}, journal={Comparative Political Studies}, year={2024}, publisher={SAGE Publications Sage CA: Los Angeles, CA} }