Spaces:

irmchek
/

mynotebooksummary

Sleeping

App Files Files Community

irmchek commited on Apr 16

Commit

57d40ed

1 Parent(s): 462fea8

summarizer version 1: used a different model for creating a summary. The summary generated includes the title in the first sentence.

Browse files

Files changed (3) hide show

enhanced_notebook.ipynb +298 -0
notebook_enhancer.py +48 -48
test.ipynb +104 -0

enhanced_notebook.ipynb ADDED Viewed

	@@ -0,0 +1,298 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Data Science Analysis Notebook\n",
+    "\n",
+    "This notebook contains some example Python code for data analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#  Create a function to summarize the code.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "At first, we will start by importing the pandas and numpy modules.\n",
+    " Then we will use the seaborn library.\n",
+    " Next step is to set the style of the visualization.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import libraries\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import seaborn as sns\n",
+    "\n",
+    "# Set visualization style\n",
+    "sns.set(style='whitegrid')\n",
+    "%matplotlib inline"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a function summarize and load the dataset.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "To Load the dataset\n",
+    " To display the basic information, use the print statement in the function.\n",
+    " To print the dataset shape and head method.\n",
+    "\n",
+    " Create a new dataframe with the shape of the dataframe and the head method"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the dataset\n",
+    "df = pd.read_csv('housing_data.csv')\n",
+    "\n",
+    "# Display basic information\n",
+    "print(f\"Dataset shape: {df.shape}\")\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Create a function summarize to perform the data cleaning.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "In the for loop we iterate through the dataframe and fill missing values with median.\n",
+    " For each column in the dataframe, we check if the column is float64 or int64 type. If it is then we use the mode() function"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Perform data cleaning\n",
+    "# Fill missing values with median\n",
+    "for column in df.columns:\n",
+    "    if df[column].dtype in ['float64', 'int64']:\n",
+    "        df[column].fillna(df[column].median(), inplace=True)\n",
+    "    else:\n",
+    "        df[column].fillna(df[column].mode()[0], inplace=True)\n",
+    "\n",
+    "# Check for remaining missing values\n",
+    "print(\"Missing values after cleaning:\")\n",
+    "print(df.isnull().sum())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#  Create a function to summarize the data.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "For each column in the dataframe, create a list of numeric columns.\n",
+    " Then create a correlation matrix.\n",
+    " Next step is to create a function that takes in a dataframe and returns the correlation matrix as an argument."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Exploratory data analysis\n",
+    "# Create correlation matrix\n",
+    "numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns\n",
+    "correlation_matrix = df[numeric_columns].corr()\n",
+    "\n",
+    "# Plot heatmap\n",
+    "plt.figure(figsize=(12, 10))\n",
+    "sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)\n",
+    "plt.title('Correlation Matrix of Numeric Features', fontsize=18)\n",
+    "plt.xticks(rotation=45, ha='right')\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#  Create a variable called bedrooms_ratio and rooms_per_household.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "If 'bedrooms' in the column and total_rooms is the column then create a new feature and scale it.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Feature engineering\n",
+    "# Create new features\n",
+    "if 'bedrooms' in df.columns and 'total_rooms' in df.columns:\n",
+    "    df['bedrooms_ratio'] = df['bedrooms'] / df['total_rooms']\n",
+    "\n",
+    "if 'total_rooms' in df.columns and 'households' in df.columns:\n",
+    "    df['rooms_per_household'] = df['total_rooms'] / df['households']\n",
+    "\n",
+    "# Scale numeric features\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "scaler = StandardScaler()\n",
+    "df[numeric_columns] = scaler.fit_transform(df[numeric_columns])\n",
+    "\n",
+    "# Display transformed data\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#  Create a simple prediction model\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "This function will build a model that can be used to train and evaluate the model.\n",
+    " Next step is to split the dataframe into training and test data and predict the median_house_value column using the train_test_split function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Build a simple prediction model\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.linear_model import LinearRegression\n",
+    "from sklearn.metrics import mean_squared_error, r2_score\n",
+    "\n",
+    "# Assume we're predicting median_house_value\n",
+    "if 'median_house_value' in df.columns:\n",
+    "    # Prepare features and target\n",
+    "    X = df.drop('median_house_value', axis=1)\n",
+    "    y = df['median_house_value']\n",
+    "    \n",
+    "    # Split the data\n",
+    "    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n",
+    "    \n",
+    "    # Train the model\n",
+    "    model = LinearRegression()\n",
+    "    model.fit(X_train, y_train)\n",
+    "    \n",
+    "    # Make predictions\n",
+    "    y_pred = model.predict(X_test)\n",
+    "    \n",
+    "    # Evaluate the model\n",
+    "    mse = mean_squared_error(y_test, y_pred)\n",
+    "    r2 = r2_score(y_test, y_pred)\n",
+    "    \n",
+    "    print(f\"Mean Squared Error: {mse:.2f}\")\n",
+    "    print(f\"R² Score: {r2:.2f}\")\n",
+    "    \n",
+    "    # Plot actual vs predicted values\n",
+    "    plt.figure(figsize=(10, 6))\n",
+    "    plt.scatter(y_test, y_pred, alpha=0.5)\n",
+    "    plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--')\n",
+    "    plt.xlabel('Actual Values')\n",
+    "    plt.ylabel('Predicted Values')\n",
+    "    plt.title('Actual vs Predicted Values')\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

notebook_enhancer.py CHANGED Viewed

@@ -8,42 +8,52 @@ from transformers import (
     AutoTokenizer,
     AutoConfig,
     pipeline,
-    SummarizationPipeline,
 )
 import re
-MODEL_NAME = "sagard21/python-code-explainer"
 class NotebookEnhancer:
     def __init__(self):
-        self.config = AutoConfig.from_pretrained(MODEL_NAME)
-        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding=True)
-        self.model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
-        self.model.eval()
-        self.pipeline = pipeline(
             "summarization",
-            model=MODEL_NAME,
-            config=self.config,
-            tokenizer=self.tokenizer,
         )
         self.nlp = spacy.load("en_core_web_sm")
-    def generate_title(self, code):
         """Generate a concise title for a code cell"""
-        # Limit input length to match model constraints
-        max_length = len(code) // 2
-        print("Title Max length", max_length)
-        truncated_code = code[:max_length] if len(code) > max_length else code
-        max_length = len(truncated_code) // 2
-        title = self.pipeline(code, min_length=5, max_length=30)[0][
-            "summary_text"
-        ].strip()
-        print("Result title", title)
-        # Format as a markdown title
-        return f"# {title.capitalize()}"
     def _count_num_words(self, code):
         words = code.split(" ")
@@ -51,23 +61,16 @@ class NotebookEnhancer:
     def generate_summary(self, code):
         """Generate a detailed summary for a code cell"""
-        # result = self.pipeline([code], min_length=3, max_length=len(code // 2))
-        print("Code", code)
-        result = self.pipeline(code, min_length=5, max_length=30)
-        print(result)
         summary = result[0]["summary_text"].strip()
-        summary = self._postprocess_summary(summary)
-        print("Result summary", summary)
-        # print(self._is_valid_sentence_nlp(summary))
-        # summary = result[0]["summary_text"].strip()
-        return f"{summary}"
     def enhance_notebook(self, notebook: nbformat.notebooknode.NotebookNode):
         """Add title and summary markdown cells before each code cell"""
         # Create a new notebook
         enhanced_notebook = nbformat.v4.new_notebook()
         enhanced_notebook.metadata = notebook.metadata
-        print(len(notebook.cells))
         # Process each cell
         i = 0
         id = len(notebook.cells) + 1
@@ -76,14 +79,11 @@ class NotebookEnhancer:
             # For code cells, add title and summary markdown cells
             if cell.cell_type == "code" and cell.source.strip():
                 # Generate summary
-                summary = self.generate_summary(cell.source)
                 summary_cell = nbformat.v4.new_markdown_cell(summary)
                 summary_cell.outputs = []
                 summary_cell.id = id
                 id += 1
-                # Generate title based on the summary cell
-                title = self.generate_title(summary)
                 title_cell = nbformat.v4.new_markdown_cell(title)
                 title_cell.outputs = []
                 title_cell.id = id
@@ -91,7 +91,6 @@ class NotebookEnhancer:
                 enhanced_notebook.cells.append(title_cell)
                 enhanced_notebook.cells.append(summary_cell)
             # Add the original cell
             cell.outputs = []
             enhanced_notebook.cells.append(cell)
@@ -111,14 +110,16 @@ class NotebookEnhancer:
     def _postprocess_summary(self, summary: str):
         doc = self.nlp(summary)
         sentences = list(doc.sents)
-        # ignore the first sentence
-        sentences = sentences[1:]
         # remove the trailing list enumeration
         postprocessed_sentences = []
         for sentence in sentences:
             if self.is_valid(sentence):
-                postprocessed_sentences.append(sentence.text)
-        return " ".join(postprocessed_sentences)
 def process_notebook(file_path):
@@ -129,7 +130,6 @@ def process_notebook(file_path):
         nb = nbformat.read(f, as_version=4)
     # Process the notebook
     enhanced_notebook = enhancer.enhance_notebook(nb)
-    print(enhanced_notebook)
     enhanced_notebook_str = nbformat.writes(enhanced_notebook, version=4)
     # Save to temp file
     output_path = "enhanced_notebook.ipynb"
@@ -168,7 +168,7 @@ def build_gradio_interface():
 # This will be the entry point when running the script
 if __name__ == "__main__":
-    file_input = "my_notebook.json"
-    test = process_notebook(file_input)
-    # demo = build_gradio_interface()
-    # demo.launch()

     AutoTokenizer,
     AutoConfig,
     pipeline,
 )
 import re
+import nltk
+PYTHON_CODE_MODEL = "sagard21/python-code-explainer"
+TITLE_SUMMARIZE_MODEL = "fabiochiu/t5-small-medium-title-generation"
 class NotebookEnhancer:
     def __init__(self):
+        # models + tokenizer for generating titles from code summaries
+        self.title_tokenizer = AutoTokenizer.from_pretrained(TITLE_SUMMARIZE_MODEL)
+        self.title_summarization_model = AutoModelForSeq2SeqLM.from_pretrained(
+            TITLE_SUMMARIZE_MODEL
+        )
+        # models + tokenizer for generating summaries from Python code
+        self.python_model = AutoModelForSeq2SeqLM.from_pretrained(PYTHON_CODE_MODEL)
+        self.python_tokenizer = AutoTokenizer.from_pretrained(
+            PYTHON_CODE_MODEL, padding=True
+        )
+        self.python_pipeline = pipeline(
             "summarization",
+            model=PYTHON_CODE_MODEL,
+            config=AutoConfig.from_pretrained(PYTHON_CODE_MODEL),
+            tokenizer=self.python_tokenizer,
         )
+        # initiate the language model
         self.nlp = spacy.load("en_core_web_sm")
+    def generate_title(self, summary: str):
         """Generate a concise title for a code cell"""
+        inputs = self.title_tokenizer.batch_encode_plus(
+            ["summarize: " + summary],
+            max_length=1024,
+            return_tensors="pt",
+            padding=True,
+        )  # Batch size 1
+        output = self.title_summarization_model.generate(
+            **inputs, num_beams=8, do_sample=True, min_length=10, max_length=10
+        )
+        decoded_output = self.title_tokenizer.batch_decode(
+            output, skip_special_tokens=True
+        )[0]
+        predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]
+        return f"# {predicted_title}"
     def _count_num_words(self, code):
         words = code.split(" ")
     def generate_summary(self, code):
         """Generate a detailed summary for a code cell"""
+        result = self.python_pipeline(code, min_length=5, max_length=64)
         summary = result[0]["summary_text"].strip()
+        title, summary = self._postprocess_summary(summary)
+        return f"# {title}", f"{summary}"
     def enhance_notebook(self, notebook: nbformat.notebooknode.NotebookNode):
         """Add title and summary markdown cells before each code cell"""
         # Create a new notebook
         enhanced_notebook = nbformat.v4.new_notebook()
         enhanced_notebook.metadata = notebook.metadata
         # Process each cell
         i = 0
         id = len(notebook.cells) + 1
             # For code cells, add title and summary markdown cells
             if cell.cell_type == "code" and cell.source.strip():
                 # Generate summary
+                title, summary = self.generate_summary(cell.source)
                 summary_cell = nbformat.v4.new_markdown_cell(summary)
                 summary_cell.outputs = []
                 summary_cell.id = id
                 id += 1
                 title_cell = nbformat.v4.new_markdown_cell(title)
                 title_cell.outputs = []
                 title_cell.id = id
                 enhanced_notebook.cells.append(title_cell)
                 enhanced_notebook.cells.append(summary_cell)
             # Add the original cell
             cell.outputs = []
             enhanced_notebook.cells.append(cell)
     def _postprocess_summary(self, summary: str):
         doc = self.nlp(summary)
         sentences = list(doc.sents)
         # remove the trailing list enumeration
         postprocessed_sentences = []
         for sentence in sentences:
             if self.is_valid(sentence):
+                sentence_text = sentence.text
+                sentence_text = re.sub("[0-9]+\.", "", sentence_text)
+                postprocessed_sentences.append(sentence_text)
+        title = postprocessed_sentences[0]
+        summary = postprocessed_sentences[1:]
+        return title, " ".join(summary)
 def process_notebook(file_path):
         nb = nbformat.read(f, as_version=4)
     # Process the notebook
     enhanced_notebook = enhancer.enhance_notebook(nb)
     enhanced_notebook_str = nbformat.writes(enhanced_notebook, version=4)
     # Save to temp file
     output_path = "enhanced_notebook.ipynb"
 # This will be the entry point when running the script
 if __name__ == "__main__":
+    # file_input = "my_notebook.json"
+    # test = process_notebook(file_input)
+    demo = build_gradio_interface()
+    demo.launch()

test.ipynb CHANGED Viewed

@@ -124,6 +124,110 @@
     "        print(word, word.is_alpha, word.pos_)\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,

     "        print(word, word.is_alpha, word.pos_)\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 50,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "['this function will build a model that can be used to train and']\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
+    "example_text = \"This function will build a model that can be used to train and evaluate the model.\"\n",
+    "tokenizer = T5Tokenizer.from_pretrained('t5-small')\n",
+    "model = T5ForConditionalGeneration.from_pretrained('t5-small')\n",
+    "inputs = tokenizer.batch_encode_plus([\"summarize: \" + example_text], max_length=1024, return_tensors=\"pt\", pad_to_max_length=True)  # Batch size 1\n",
+    "outputs = model.generate(inputs['input_ids'], num_beams=2, max_length=15, early_stopping=True)\n",
+    "\n",
+    "print([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in outputs])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 59,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Device set to use mps:0\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "[{'summary_text': 'An apple a day, keeps the'}]"
+      ]
+     },
+     "execution_count": 59,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from transformers import pipeline\n",
+    "summarizer = pipeline(\"summarization\", model=\"facebook/bart-large-cnn\", tokenizer=\"facebook/bart-large-cnn\")\n",
+    "summarizer(\"An apple a day, keeps the doctor away\", min_length=5, max_length=10)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 76,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[nltk_data] Downloading package punkt to /Users/irma/nltk_data...\n",
+      "[nltk_data]   Package punkt is already up-to-date!\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "This function will build a model that can be used to train and evaluate the model.\n",
+      "27\n"
+     ]
+    }
+   ],
+   "source": [
+    "from transformers import AutoTokenizer, AutoModelForSeq2SeqLM\n",
+    "import nltk\n",
+    "nltk.download('punkt')\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(\"fabiochiu/t5-small-medium-title-generation\")\n",
+    "model = AutoModelForSeq2SeqLM.from_pretrained(\"fabiochiu/t5-small-medium-title-generation\")\n",
+    "\n",
+    "text = \"This function will build a model that can be used to train and evaluate the model.\"\n",
+    "\n",
+    "inputs = [\"summarize: \" + text]\n",
+    "\n",
+    "inputs = tokenizer(inputs, max_length=1024, truncation=True, return_tensors=\"pt\")\n",
+    "output = model.generate(**inputs, num_beams=4, do_sample=True, min_length=10, max_length=len(text) // 3)\n",
+    "decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]\n",
+    "predicted_title = nltk.sent_tokenize(decoded_output.strip())[0]\n",
+    "\n",
+    "print(predicted_title)\n",
+    "# Conversational AI: The Future of Customer Service\n",
+    "print(len(text) // 3)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,