Mile-stone-3

Runtime error

App Files Files Community

kya5 commited on Apr 28, 2023

Commit

9047480

0 Parent(s):

Duplicate from kya5/milestone-3

Browse files

Files changed (20) hide show

.gitattributes +5 -0
.github/workflows/main.yml +0 -0
.github/workflows/sync_to_hf.yml +20 -0
README.md +110 -0
app.py +72 -0
bert/_bert_model/config.json +44 -0
bert/_bert_model/pytorch_model.bin +3 -0
bert/_bert_model/training_args.bin +0 -0
distilbert/_distilbert_model/config.json +41 -0
distilbert/_distilbert_model/pytorch_model.bin +3 -0
distilbert/_distilbert_model/training_args.bin +0 -0
jigsaw-toxic-comment-classification-challenge/sample_submission.csv +0 -0
jigsaw-toxic-comment-classification-challenge/test.csv +3 -0
jigsaw-toxic-comment-classification-challenge/test_labels.csv +0 -0
jigsaw-toxic-comment-classification-challenge/train.csv +3 -0
requirements.txt +5 -0
roberta/_roberta_model/config.json +43 -0
roberta/_roberta_model/pytorch_model.bin +3 -0
roberta/_roberta_model/training_args.bin +0 -0
train.py +156 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,5 @@

+bert/_bert_model/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
+distilbert/_distilbert_model/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
+roberta/_roberta_model/pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
+jigsaw-toxic-comment-classification-challenge/test.csv filter=lfs diff=lfs merge=lfs -text
+jigsaw-toxic-comment-classification-challenge/train.csv filter=lfs diff=lfs merge=lfs -text

.github/workflows/main.yml ADDED Viewed

File without changes

.github/workflows/sync_to_hf.yml ADDED Viewed

	@@ -0,0 +1,20 @@

+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://jjmakes:[email protected]/spaces/jjmakes/cs482-toxic-tweets main

README.md ADDED Viewed

	@@ -0,0 +1,110 @@

+---
+title: Cs482 Toxic Tweets
+emoji: ⚡
+colorFrom: green
+colorTo: green
+sdk: streamlit
+sdk_version: 1.17.0
+app_file: app.py
+pinned: false
+duplicated_from: kya5/milestone-3
+---
+# Finetuning Language Models - Toxic Tweets
+[![Sync to Hugging Face hub](https://github.com/jjmakes/cs482-project/actions/workflows/sync_to_hf.yml/badge.svg)](https://github.com/jjmakes/cs482-project/actions/workflows/sync_to_hf.yml)
+## [See the deployed App on HuggingFace](https://huggingface.co/spaces/jjmakes/cs482-toxic-tweets)
+CS 482 Project - [Instructions](https://pantelis.github.io/data-mining/aiml-common/projects/nlp/finetuning-language-models-tweets/index.html)
+## Milestone 1 - Development Environment
+## OS Version
+This project was created in Ubuntu 20.04. Thus, steps for installing and developing in Windows are not included.
+```
+Distributor ID: Ubuntu
+Description: Ubuntu 20.04.6 LTS
+Release: 20.04
+Codename: focal
+```
+## Docker Installation
+The instructions below will help install Docker on Ubuntu version 20.04.6
+```
+## Update list of existing packages
+sudo apt update
+## Install prerequisite packages
+sudo apt install apt-transport-https ca-certificates curl software-properties-common
+## Add GPG key for the official Docker repository
+curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
+## Add the Docker repository to APT sources
+sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"
+## Prep to install from docker repo
+apt-cache policy docker-ce
+## Install docker
+sudo apt install docker-ce
+## Check if docker is running
+sudo systemctl status docker
+## Add sudo docker permissions to current user
+sudo usermod -aG docker ${USER}[![Sync to Hugging Face hub](https://github.com/jjmakes/cs482-project/actions/workflows/sync_to_hf.yml/badge.svg)](https://github.com/jjmakes/cs482-project/actions/workflows/sync_to_hf.yml)
+## VS Code Installation
+The instructions below will help install VS Code on Ubuntu version 20.04.6
+[Download the VS Code .deb package (64 bit)](https://code.visualstudio.com/download)
+```
+## Navigate to downloads folder
+cd ~/Downloads
+## Install VS Code (replace <file> with the downloaded package)
+sudo apt install ./<file>.deb
+```
+## Creating a development environment with docker
+[Quick Start Development Container](https://code.visualstudio.com/docs/devcontainers/containers#_quick-start-try-a-development-container)
+1. **F1**, _Dev Containers: Open Folder in Container..._
+2. Select starting image
+Some notable images worth using are:
+- Alpine: Barebones Linux OS
+- Python3: Container for developing Python 3 Applications
+![](./milestone-1.png)
+## Milestone 2
+App is deployed to [HuggingFace](https://huggingface.co/spaces/jjmakes/cs482-toxic-tweets) via GitHub actions following [instructions provided in this tutorial](https://www.youtube.com/watch?v=8hOzsFETm4I). HuggingFace provides documentation for performing [sentiment analysis with python](https://huggingface.co/blog/sentiment-analysis-python).
+### Testing with Streamlit Locally
+To test with streamlit, install the project dependencies locally with:
+```
+pip3 install -r requirements.txt
+```
+To run the project, use:
+```
+streamlit run app.py --server.port 8888
+```
+The page can be set to hot-reload by selecting `Always Rerun` after a change is made.
+Models used are pretrained and provided by [HuggingFace](https://huggingface.co/models?pipeline_tag=text-classification&sort=likes&search=sentiment).

app.py ADDED Viewed

	@@ -0,0 +1,72 @@

+import streamlit as st
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+import pandas as pd
+import random
+classifiers = ['toxic', 'severe_toxic', 'obscene',
+               'threat', 'insult', 'identity_hate']
+def reset_scores():
+    global scores_df
+    scores_df = pd.DataFrame(columns=['Comment'] + classifiers)
+def get_score(model_base, text):
+    if model_base == "bert-base-cased":
+        model_dir = "./bert/_bert_model"
+    elif model_base == "distilbert-base-cased":
+        model_dir = "./distilbert/_distilbert_model"
+    else:
+        model_dir = "./roberta/_roberta_model"
+    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
+    tokenizer = AutoTokenizer.from_pretrained(model_base)
+    inputs = tokenizer.encode_plus(
+        text, max_length=512, truncation=True, padding=True, return_tensors='pt')
+    outputs = model(**inputs)
+    predictions = torch.sigmoid(outputs.logits)
+    return predictions
+st.title("Toxic Comment Classifier")
+model_base = st.selectbox("Select a pretrained model",
+                          ["roberta-base", "bert-base-cased", "distilbert-base-cased"])
+text_input = st.text_input("Enter text for toxicity classification",
+                           "")
+submit_btn = st.button("Submit")
+if submit_btn and text_input:
+    result = get_score(model_base, text_input)
+    df = pd.DataFrame([result[0].tolist()], columns=classifiers)
+    df = df.round(2)  # Round the values to 2 decimal places
+    df = df.applymap(lambda x: '{:.0%}'.format(x))
+    st.table(df)
+test_df = pd.read_csv(
+    "./jigsaw-toxic-comment-classification-challenge/test.csv")
+sample_df = test_df.sample(n=3)
+reset_scores()
+for index, row in sample_df.iterrows():
+    result = get_score(model_base, row['comment_text'])
+    scores = result[0].tolist()
+    scores_df.loc[len(scores_df)] = [row['comment_text']] + scores
+scores_df = scores_df.round(2)
+st.subheader("Toxicity Scores for Random Comments")
+if st.button("Refresh"):
+    reset_scores()
+    st.success("New tweets have been loaded!")
+st.table(scores_df)

bert/_bert_model/config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "_name_or_path": "vinai/bertweet-base",
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 130,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "multi_label_classification",
+  "tokenizer_class": "BertweetTokenizer",
+  "transformers_version": "4.8.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 64001
+}

bert/_bert_model/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c1c171ff9ebed4a7224889a84edd1ea084ed01f4bcda6c6a637bb1ed63d3d196
+size 539702389

bert/_bert_model/training_args.bin ADDED Viewed

Binary file (2.56 kB). View file

distilbert/_distilbert_model/config.json ADDED Viewed

	@@ -0,0 +1,41 @@

+{
+  "_name_or_path": "distilbert-base-cased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "problem_type": "multi_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.8.0",
+  "vocab_size": 28996
+}

distilbert/_distilbert_model/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4276639fc9c2f4f22680df4f17412ba1cf058f6e3a0b4f77a6df203cea934b9
+size 263185709

distilbert/_distilbert_model/training_args.bin ADDED Viewed

Binary file (2.56 kB). View file

jigsaw-toxic-comment-classification-challenge/sample_submission.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

jigsaw-toxic-comment-classification-challenge/test.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2513ce4abb98c4d1d216e3ca0d4377d57589a0989aa8c06a840509a16c786e8
+size 60354593

jigsaw-toxic-comment-classification-challenge/test_labels.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

jigsaw-toxic-comment-classification-challenge/train.csv ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd4084611bd27c939ba98e5e63bc3e5a2c1a4e99477dcba46c829e4c986c429d
+size 68802655

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit
+numpy
+transformers
+tensorflow
+torch

roberta/_roberta_model/config.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "_name_or_path": "roberta-base",
+  "architectures": [
+    "RobertaForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_2": 2,
+    "LABEL_3": 3,
+    "LABEL_4": 4,
+    "LABEL_5": 5
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "multi_label_classification",
+  "transformers_version": "4.8.0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

roberta/_roberta_model/pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56b176692120cdc3c43be9880d33b1e6fa138146784a91f6c473cc3c701c81ce
+size 498688117

roberta/_roberta_model/training_args.bin ADDED Viewed

Binary file (2.56 kB). View file

train.py ADDED Viewed

	@@ -0,0 +1,156 @@

+import pandas as pd
+import os
+from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, RobertaTokenizer, RobertaForSequenceClassification, GPT2Tokenizer, GPT2ForSequenceClassification
+import torch
+from torch.utils.data import Dataset
+torch.cuda.empty_cache()
+class MultiLabelClassifierDataset(Dataset):
+    def __init__(self, encodings, labels):
+        self.encodings = encodings
+        self.labels = labels
+    def __getitem__(self, idx):
+        item = {key: torch.tensor(val[idx])
+                for key, val in self.encodings.items()}
+        item['labels'] = torch.tensor(self.labels[idx]).float()
+        return item
+    def __len__(self):
+        return len(self.labels)
+work_dir = os.path.dirname(os.path.realpath(__file__)) + '/'
+dataset_dir = work_dir + 'jigsaw-toxic-comment-classification-challenge/'
+classifiers = ['toxic', 'severe_toxic', 'obscene',
+               'threat', 'insult', 'identity_hate']
+df = pd.read_csv(dataset_dir + 'train.csv')
+df = df.sample(frac=1).reset_index(drop=True)  # Shuffle
+train_df = df[:int(len(df)*0.1)]
+train_labels = train_df[classifiers].to_numpy()
+device = torch.device('cuda')
+print("Using device: ", device)
+training_args = TrainingArguments(
+    output_dir='./results',
+    num_train_epochs=2,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=64,
+    warmup_steps=500,
+    weight_decay=0.01,
+    logging_dir='./logs',
+    logging_steps=10,
+    fp16=True
+)
+print("BERT")
+bert_dir = work_dir + 'bert/'
+print("Model base: ", "vinai/bertweet-base")
+tokenizer = AutoTokenizer.from_pretrained(
+    "vinai/bertweet-base", model_max_length=128)
+train_encodings = tokenizer(
+    train_df['comment_text'].tolist(), truncation=True, padding=True)
+print("Training model to be stored in" + bert_dir)
+print("Creating dataset")
+train_dataset = MultiLabelClassifierDataset(train_encodings, train_labels)
+print("Loading model for training...")
+model = AutoModelForSequenceClassification.from_pretrained(
+    'vinai/bertweet-base', num_labels=6)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset
+)
+trainer.train()
+trainer.save_model(bert_dir + '_bert_model')
+training_args = TrainingArguments(
+    output_dir='./results',
+    num_train_epochs=1,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=16,
+    warmup_steps=500,
+    weight_decay=0.01,
+    logging_dir='./logs',
+    logging_steps=10,
+    fp16=True
+)
+print("RoBERTa")
+roberta_dir = work_dir + 'roberta/'
+tokenizer = RobertaTokenizer.from_pretrained(
+    'roberta-base', model_max_length=128)
+train_encodings = tokenizer(
+    train_df['comment_text'].tolist(), truncation=True, padding=True)
+train_dataset = MultiLabelClassifierDataset(train_encodings, train_labels)
+model = AutoModelForSequenceClassification.from_pretrained(
+    'roberta-base', num_labels=6)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset
+)
+trainer.train()
+trainer.save_model(roberta_dir + '_roberta_model')
+training_args = TrainingArguments(
+    output_dir='./results',
+    num_train_epochs=1,
+    per_device_train_batch_size=32,
+    per_device_eval_batch_size=64,
+    warmup_steps=500,
+    weight_decay=0.01,
+    logging_dir='./logs',
+    logging_steps=10,
+    fp16=True
+)
+print("DISTILBERT")
+distilbert_dir = work_dir + 'distilbert/'
+tokenizer = AutoTokenizer.from_pretrained(
+    'distilbert-base-cased', model_max_length=128)
+train_encodings = tokenizer(
+    train_df['comment_text'].tolist(), truncation=True, padding=True)
+train_dataset = MultiLabelClassifierDataset(train_encodings, train_labels)
+model = AutoModelForSequenceClassification.from_pretrained(
+    'distilbert-base-cased', num_labels=6)
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset
+)
+trainer.train()
+trainer.save_model(distilbert_dir + '_distilbert_model')