{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Text Similarity Prediction and Analysis\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Objective\n", "\n", "The aim of this project is to create a system that can analyze the similarity between records by using text analysis techniques. The system will employ natural language processing methods and similarity metrics to assess the similarity of textual content present in different documents. This analysis will enable applications such as document retrieval, clustering, and recommendation systems to provide more accurate and relevant results based on the similarity of document contents. The goal is to improve information management and information retrieval workflows by providing a robust and efficient method for measuring document similarity using text analysis.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem Statement\n", "\n", "Effectively organizing, retrieving, and using large volumes of textual documents is a vital challenge in many domains, including digital libraries, knowledge management systems, and content recommendation platforms. With the exponential growth of digital information, it's becoming increasingly difficult to manually categorize, cluster, and identify related records. Without efficient methods to measure the similarity between records based on their textual content, organizations struggle to manage their document repositories effectively, hampering productivity and decision-making processes.\n", "\n", "The traditional keyword-based search and retrieval methods often fall short of capturing the true semantic similarities between documents, leading to incomplete or irrelevant results. This creates a pressing need for advanced text analysis techniques that accurately assess the degree of similarity between documents, considering their contextual meaning, themes, and conceptual overlaps.\n", "\n", "By developing a robust text similarity analysis system, organizations can unlock multiple benefits, including improved information retrieval, enhanced content clustering and categorization, and more effective recommendation systems. Accurate similarity analysis allows users to identify related documents quickly, facilitating knowledge sharing, collaboration, and decision-making processes. Moreover, such a system can streamline document management workflows, reducing redundancy and enabling more efficient storage and organization of textual information.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Justification\n", "\n", "The exponential growth of digital information has resulted in an overwhelming amount of textual data in the form of documents, reports, articles, and other written materials. As this flood of information continues to expand, organizations across various sectors are struggling to manage, organize, and extract value from their document repositories effectively.\n", "\n", "Traditional methods of document management, such as manual categorization and keyword-based search, are increasingly inadequate in handling the scale and complexity of modern document collections. These approaches often fail to capture the true semantic similarities between documents, leading to incomplete or irrelevant search results, inefficient clustering, and missed opportunities for knowledge discovery.\n", "\n", "Developing a robust system for analyzing document similarity that leverages advanced text analysis techniques and natural language processing methods is essential for several reasons:\n", "\n", "1. Improved information retrieval: By accurately measuring the similarity between documents based on their textual content, users can quickly identify and retrieve related materials, enhancing research, decision-making, and knowledge-sharing processes.\n", "\n", "2. Efficient document clustering and categorization: Similarity analysis enables automated document clustering and categorization, reducing the need for manual effort and ensuring that related documents are organized together for easier access and navigation.\n", "\n", "3. Enhanced recommendation systems: By understanding the semantic relationships between documents, recommendation systems can provide more relevant and personalized suggestions, improving user experience and facilitating content discovery.\n", "\n", "4. Reduction of redundancy and duplication: Identifying highly similar or duplicate documents can help organizations streamline their document repositories, reducing storage requirements and improving overall efficiency.\n", "\n", "5. Knowledge extraction and insight generation: Analyzing similarities between documents can reveal patterns, trends, and connections that may not be immediately apparent, enabling organizations to uncover valuable insights and make data-driven decisions.\n", "\n", "Moreover, as the volume of digital information continues to grow, the importance of effective document similarity analysis will only increase. Failing to address this challenge can lead to inefficient information management, missed opportunities, and a competitive disadvantage for organizations that rely heavily on textual data.\n", "\n", "By investing in the development of a robust document similarity analysis system, organizations can future-proof their document management processes, gain a deeper understanding of their information assets, and unlock new opportunities for innovation and growth.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Research Data\n", "\n", "The [STS (Semantic Textual Similarity) Benchmark dataset](https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark) is a popular resource for evaluating the performance of systems designed to measure the semantic similarity between pairs of sentences. It is widely used in the natural language processing community for tasks such as text understanding, paraphrase detection, and sentence similarity analysis.\n", "\n", "The STS Benchmark dataset consists of a collection of sentence pairs, each accompanied by a human-annotated similarity score ranging from 0 (no semantic similarity) to 5 (semantic equivalence). These sentence pairs are drawn from various sources, including news articles, image captions, and online forums, covering a diverse range of topics and domains.\n", "\n", "The STS Benchmark dataset has been widely adopted due to its diversity, the availability of human-annotated similarity scores, and its usefulness in evaluating the performance of various semantic similarity models and algorithms. It provides a standardized and well-curated resource for researchers and developers working on natural language processing tasks involving semantic similarity analysis.\n", "\n", "## Description of Data\n", "\n", "The benchmark comprises 8628 sentence pairs. This is the breakdown according to genres and train-dev-test splits:\n", "\n", "| | train | dev | test | total |\n", "| ------- | ----- | ---- | ---- | ----- |\n", "| news | 3299 | 500 | 500 | 4299 |\n", "| caption | 2000 | 625 | 625 | 3250 |\n", "| forum | 450 | 375 | 254 | 1079 |\n", "| total | 5749 | 1500 | 1379 | 8628 |\n", "\n", "Breakdown according to the original names and task years of the datasets:\n", "\n", "| genre | file | years | train | dev | test |\n", "| -------- | -------------- | ------- | ----- | --- | ---- |\n", "| news | MSRpar | 2012 | 1000 | 250 | 250 |\n", "| news | headlines | 2013-16 | 1999 | 250 | 250 |\n", "| news | deft-news | 2014 | 300 | 0 | 0 |\n", "| captions | MSRvid | 2012 | 1000 | 250 | 250 |\n", "| captions | images | 2014-15 | 1000 | 250 | 250 |\n", "| captions | track5.en-en | 2017 | 0 | 125 | 125 |\n", "| forum | deft-forum | 2014 | 450 | 0 | 0 |\n", "| forum | answers-forums | 2015 | 0 | 375 | 0 |\n", "| forum | answer-answer | 2016 | 0 | 0 | 254 |\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Research Questions\n", "\n", "1. What are the most effective natural language processing techniques and algorithms for measuring semantic similarity between documents?\n", "\n", "2. What are the computational and scalability challenges associated with performing similarity analysis on text collections, and how can these challenges be addressed?\n", "\n", "3. How can user interactions and feedback be effectively incorporated into the similarity analysis system to improve its accuracy and adaptability over time?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Assumptions:\n", "\n", "1. There exists a sufficient quantity and variety textual datasets for training and evaluating the performance of the document similarity analysis system.\n", "2. The availability of computational resources, including processing power and memory, is adequate to support the implementation and deployment of the document similarity analysis system at scale.\n", "\n", "3. The natural language processing techniques and similarity metrics selected for the system are capable of effectively capturing semantic relationships and nuances within textual documents across different languages and domains.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scope:\n", "\n", "To develop a minimum viable prototype of a document similarity analysis system using natural language processing techniques. The prototype will be designed to accept text input in the form of documents or text passages and analyze their semantic similarity.\n", "\n", "The core functionality will include:\n", "\n", "1. Text extraction and preprocessing.\n", "2. Embedding documents into vector representations using pre-trained language models\n", "3. Calculating pairwise similarity scores between document embeddings using cosine similarity or other distance metrics\n", "4. Returning a ranked list of similar documents given an input document\n", "\n", "It will be developed as a simple web application using Gradio and deployed on Hugging Face Spaces for easy access and testing.\n", "\n", "The initial scope is limited to handling text input in English. Advanced features like multilingual support, domain adaptation, scalability optimizations, and user feedback incorporation are out of scope for this prototype. The primary goal is to demonstrate the core document similarity analysis capabilities using readily available NLP tools and models.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Hypothesis\n", "\n", "By leveraging NLP models and semantic text similarity techniques, it is hypothesized that the developed prototype system will be able to accurately measure and rank the similarities between documents based on their contextual content. Specifically, the prototype will demonstrate an improvement in identifying semantically related documents compared to traditional keyword-based approaches. This will be achieved by projecting documents into high-dimensional vector representations that capture their underlying meanings and concepts, allowing for a more robust similarity comparison.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Code\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instalations\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# %pip install datasets transformers\n", "# %pip install transformers\n", "# %pip install accelerate -U\n", "# %pip install streamlit\n", "# %pip install textdistance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports\n" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset\n", "from matplotlib import pyplot as plt\n", "from sentence_transformers import CrossEncoder, SentenceTransformer, losses, models\n", "from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator\n", "from sentence_transformers.readers import InputExample\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.metrics.pairwise import cosine_similarity\n", "from sklearn.preprocessing import StandardScaler\n", "from torch.utils.data import DataLoader\n", "import math\n", "import pandas as pd\n", "import textdistance\n", "import numpy as np\n", "import joblib\n", "from samples import get_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load English train Dataset:\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found cached dataset json (/Users/charleskabue/.cache/huggingface/datasets/mteb___json/mteb--stsbenchmark-sts-998a21523b45a16a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "ee8e8730f3834a00ab4599949457d0d4", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
epochstepscosine_pearsoncosine_spearmaneuclidean_pearsoneuclidean_spearmanmanhattan_pearsonmanhattan_spearmandot_pearsondot_spearman
00-10.8314590.8222860.7993200.7950010.7984730.7942570.6980200.691569
11-10.8407060.8343370.8152100.8122200.8144980.8116890.7327460.730361
22-10.8439260.8359040.8262070.8221510.8259240.8216810.7487940.737895
33-10.8464280.8394870.8281320.8252380.8278270.8247390.7552070.744858
40-10.7922020.7713970.7603670.7439110.7602030.7441130.6816470.667173
51-10.8353590.8271300.8070310.8038340.8064470.8034000.7249160.715752
62-10.8383580.8303130.8194440.8157610.8194000.8159930.7427200.731048
73-10.8423780.8376500.8220140.8205380.8219500.8207510.7469230.737437
84-10.8435330.8378120.8309530.8276990.8312410.8280580.7600230.746661
95-10.8426150.8388480.8275270.8270530.8274680.8272640.7583690.746365
106-10.8434840.8387890.8278000.8262390.8279300.8266220.7563450.745615
117-10.8447620.8402910.8286160.8273430.8284180.8273610.7632270.751573
128-10.8430060.8397700.8278600.8268930.8278710.8270510.7655440.753510
139-10.8463040.8428570.8289140.8280650.8286300.8280030.7655770.754536
1410-10.8454910.8410960.8312680.8301730.8310420.8299340.7665160.754324
1511-10.8450380.8415200.8300220.8293240.8298680.8291250.7646270.752546
1612-10.8452630.8422070.8310570.8303020.8309400.8301020.7663760.754835
1713-10.8449130.8419110.8300020.8289790.8299620.8290430.7668090.755277
1814-10.8449500.8423890.8306680.8298020.8305860.8297360.7668180.755222
1915-10.8451690.8425460.8305440.8297470.8304600.8296290.7670470.755781
200-10.7899960.7715070.7511550.7358400.7505950.7354470.6563770.647560
211-10.8347230.8256290.8058970.8011950.8052060.8006860.7141480.704298
222-10.8436090.8367680.8245530.8216470.8240050.8212320.7502880.740117
233-10.8442900.8373150.8233690.8208390.8229070.8207710.7526190.743209
244-10.8483530.8438940.8292640.8287150.8290970.8288210.7396620.729897
255-10.8469340.8413120.8287250.8271660.8283840.8270490.7624020.751935
266-10.8455850.8414050.8298970.8288220.8297700.8287400.7642670.753591
277-10.8453880.8408610.8312780.8296170.8311280.8296100.7677830.756129
288-10.8469800.8436050.8317180.8307410.8315030.8304590.7688330.757720
299-10.8467200.8436340.8298110.8295120.8296740.8292680.7655900.754997
3010-10.8471160.8432490.8321400.8307530.8319110.8307220.7679150.754851
3111-10.8473490.8440040.8324370.8316590.8321990.8315150.7693080.758294
3212-10.8475480.8444490.8327240.8321650.8324180.8319730.7685210.757657
3313-10.8473600.8439960.8327690.8318740.8324900.8316560.7688800.757353
3414-10.8481170.8449730.8328090.8320480.8325870.8319150.7698450.758487
3515-10.8480330.8449080.8327580.8319450.8325270.8319460.7694070.758096
\n", "" ], "text/plain": [ " epoch steps cosine_pearson cosine_spearman euclidean_pearson \\\n", "0 0 -1 0.831459 0.822286 0.799320 \n", "1 1 -1 0.840706 0.834337 0.815210 \n", "2 2 -1 0.843926 0.835904 0.826207 \n", "3 3 -1 0.846428 0.839487 0.828132 \n", "4 0 -1 0.792202 0.771397 0.760367 \n", "5 1 -1 0.835359 0.827130 0.807031 \n", "6 2 -1 0.838358 0.830313 0.819444 \n", "7 3 -1 0.842378 0.837650 0.822014 \n", "8 4 -1 0.843533 0.837812 0.830953 \n", "9 5 -1 0.842615 0.838848 0.827527 \n", "10 6 -1 0.843484 0.838789 0.827800 \n", "11 7 -1 0.844762 0.840291 0.828616 \n", "12 8 -1 0.843006 0.839770 0.827860 \n", "13 9 -1 0.846304 0.842857 0.828914 \n", "14 10 -1 0.845491 0.841096 0.831268 \n", "15 11 -1 0.845038 0.841520 0.830022 \n", "16 12 -1 0.845263 0.842207 0.831057 \n", "17 13 -1 0.844913 0.841911 0.830002 \n", "18 14 -1 0.844950 0.842389 0.830668 \n", "19 15 -1 0.845169 0.842546 0.830544 \n", "20 0 -1 0.789996 0.771507 0.751155 \n", "21 1 -1 0.834723 0.825629 0.805897 \n", "22 2 -1 0.843609 0.836768 0.824553 \n", "23 3 -1 0.844290 0.837315 0.823369 \n", "24 4 -1 0.848353 0.843894 0.829264 \n", "25 5 -1 0.846934 0.841312 0.828725 \n", "26 6 -1 0.845585 0.841405 0.829897 \n", "27 7 -1 0.845388 0.840861 0.831278 \n", "28 8 -1 0.846980 0.843605 0.831718 \n", "29 9 -1 0.846720 0.843634 0.829811 \n", "30 10 -1 0.847116 0.843249 0.832140 \n", "31 11 -1 0.847349 0.844004 0.832437 \n", "32 12 -1 0.847548 0.844449 0.832724 \n", "33 13 -1 0.847360 0.843996 0.832769 \n", "34 14 -1 0.848117 0.844973 0.832809 \n", "35 15 -1 0.848033 0.844908 0.832758 \n", "\n", " euclidean_spearman manhattan_pearson manhattan_spearman dot_pearson \\\n", "0 0.795001 0.798473 0.794257 0.698020 \n", "1 0.812220 0.814498 0.811689 0.732746 \n", "2 0.822151 0.825924 0.821681 0.748794 \n", "3 0.825238 0.827827 0.824739 0.755207 \n", "4 0.743911 0.760203 0.744113 0.681647 \n", "5 0.803834 0.806447 0.803400 0.724916 \n", "6 0.815761 0.819400 0.815993 0.742720 \n", "7 0.820538 0.821950 0.820751 0.746923 \n", "8 0.827699 0.831241 0.828058 0.760023 \n", "9 0.827053 0.827468 0.827264 0.758369 \n", "10 0.826239 0.827930 0.826622 0.756345 \n", "11 0.827343 0.828418 0.827361 0.763227 \n", "12 0.826893 0.827871 0.827051 0.765544 \n", "13 0.828065 0.828630 0.828003 0.765577 \n", "14 0.830173 0.831042 0.829934 0.766516 \n", "15 0.829324 0.829868 0.829125 0.764627 \n", "16 0.830302 0.830940 0.830102 0.766376 \n", "17 0.828979 0.829962 0.829043 0.766809 \n", "18 0.829802 0.830586 0.829736 0.766818 \n", "19 0.829747 0.830460 0.829629 0.767047 \n", "20 0.735840 0.750595 0.735447 0.656377 \n", "21 0.801195 0.805206 0.800686 0.714148 \n", "22 0.821647 0.824005 0.821232 0.750288 \n", "23 0.820839 0.822907 0.820771 0.752619 \n", "24 0.828715 0.829097 0.828821 0.739662 \n", "25 0.827166 0.828384 0.827049 0.762402 \n", "26 0.828822 0.829770 0.828740 0.764267 \n", "27 0.829617 0.831128 0.829610 0.767783 \n", "28 0.830741 0.831503 0.830459 0.768833 \n", "29 0.829512 0.829674 0.829268 0.765590 \n", "30 0.830753 0.831911 0.830722 0.767915 \n", "31 0.831659 0.832199 0.831515 0.769308 \n", "32 0.832165 0.832418 0.831973 0.768521 \n", "33 0.831874 0.832490 0.831656 0.768880 \n", "34 0.832048 0.832587 0.831915 0.769845 \n", "35 0.831945 0.832527 0.831946 0.769407 \n", "\n", " dot_spearman \n", "0 0.691569 \n", "1 0.730361 \n", "2 0.737895 \n", "3 0.744858 \n", "4 0.667173 \n", "5 0.715752 \n", "6 0.731048 \n", "7 0.737437 \n", "8 0.746661 \n", "9 0.746365 \n", "10 0.745615 \n", "11 0.751573 \n", "12 0.753510 \n", "13 0.754536 \n", "14 0.754324 \n", "15 0.752546 \n", "16 0.754835 \n", "17 0.755277 \n", "18 0.755222 \n", "19 0.755781 \n", "20 0.647560 \n", "21 0.704298 \n", "22 0.740117 \n", "23 0.743209 \n", "24 0.729897 \n", "25 0.751935 \n", "26 0.753591 \n", "27 0.756129 \n", "28 0.757720 \n", "29 0.754997 \n", "30 0.754851 \n", "31 0.758294 \n", "32 0.757657 \n", "33 0.757353 \n", "34 0.758487 \n", "35 0.758096 " ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Evaluate the model performance\n", "eval_df = pd.read_csv(f\"{model_save_path}/eval/similarity_evaluation_sts-dev_results.csv\")\n", "eval_df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Plot the model performance evaluation" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "# plot figure size\n", "plt.figure(figsize=(12, 6))\n", "# plot each column\n", "for column in eval_df.drop(columns=['epoch', 'steps']).columns:\n", " plt.plot(eval_df['epoch'], eval_df[column], label=column)\n", "# put ledgets outside plot\n", "plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')\n", "plt.xlabel('epoch')\n", "plt.ylabel('prediction accuracy')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test the model\n" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Found cached dataset json (/Users/charleskabue/.cache/huggingface/datasets/mteb___json/mteb--stsbenchmark-sts-998a21523b45a16a/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "cf477c12b90443f48828cd85cbf27e58", "version_major": 2, "version_minor": 0 }, "text/plain": [ " 0%| | 0/3 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2score
0A man with a hard hat is dancing.A man wearing a hard hat is dancing.5.0
1A man is fitting silencer on a pistol.A man is adding a silencer to a gun.4.5
2Kittens are eating food.Kittens are eating from dishes.4.0
3A woman is mixing ingrediants.A woman is mixing food in a bowl.3.5
4A woman is cooking eggs.A woman is cooking something.3.0
5Someone is beating an egg.A woman stirs eggs in a bowl.2.5
6A small baby is playing a guitar.A boy sits on a bed, sings and plays a guitar.2.0
7I think it is still feasible to store seeds un...I haven't tried storing tomato seeds myself, b...1.5
8A man is playing soccer.A man is playing flute.1.0
9Two little girls are talking on the phone.A little girl is walking down the street.0.5
10The man is riding a horse.A woman is using a hoe.0.0
\n", "" ], "text/plain": [ " sentence1 \\\n", "0 A man with a hard hat is dancing. \n", "1 A man is fitting silencer on a pistol. \n", "2 Kittens are eating food. \n", "3 A woman is mixing ingrediants. \n", "4 A woman is cooking eggs. \n", "5 Someone is beating an egg. \n", "6 A small baby is playing a guitar. \n", "7 I think it is still feasible to store seeds un... \n", "8 A man is playing soccer. \n", "9 Two little girls are talking on the phone. \n", "10 The man is riding a horse. \n", "\n", " sentence2 score \n", "0 A man wearing a hard hat is dancing. 5.0 \n", "1 A man is adding a silencer to a gun. 4.5 \n", "2 Kittens are eating from dishes. 4.0 \n", "3 A woman is mixing food in a bowl. 3.5 \n", "4 A woman is cooking something. 3.0 \n", "5 A woman stirs eggs in a bowl. 2.5 \n", "6 A boy sits on a bed, sings and plays a guitar. 2.0 \n", "7 I haven't tried storing tomato seeds myself, b... 1.5 \n", "8 A man is playing flute. 1.0 \n", "9 A little girl is walking down the street. 0.5 \n", "10 A woman is using a hoe. 0.0 " ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_samples = get_samples()\n", "test_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-Encoder\n", "\n", "We pass both sentences simultaneously to the Transformer network. It produces then an output value between 0 and 1 indicating the similarity of the input sentence pair, see [cross-encoders-usage](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/cross-encoder#cross-encoders-usage).\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at trained_model_stsbenchmark_bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] } ], "source": [ "cross_encoder_model = CrossEncoder(model_save_path)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.60892063" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_encoder_model.predict([\n", " 'A man with a hard hat is dancing.',\n", " 'A man wearing a hard hat is dancing.'])" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5701721" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cross_encoder_model.predict([\n", " 'A dog and cat laying down together.',\n", " 'Two grey dogs are carrying a stick in the water.'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Bi-Encoder\n", "\n", "Bi-Encoders produce sentence embedding. These sentence embedding can then be compared using cosine similarity.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model = SentenceTransformer(model_save_path)" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.9799072]], dtype=float32)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cosine_similarity(\n", " [model.encode('A man with a hard hat is dancing.')],\n", " [model.encode('A man wearing a hard hat is dancing.')])" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.13418931]], dtype=float32)" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cosine_similarity(\n", " [model.encode('A dog and cat laying down together.')],\n", " [model.encode('Two grey dogs are carrying a stick in the water.')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Comparison\n", "\n", "Normally, Cross-Encoder achieve higher performance than Bi-Encoders, however, they do not scale well for large datasets, ([Reimers, Nils and Gurevych, Iryna](https://github.com/UKPLab/sentence-transformers/tree/master/examples/applications/cross-encoder#combining-bi--and-cross-encoders)). But in the case of this dataset, Bi-Encoders achieve a higher, probably due to small dataset used for training.\n" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2scorenormalized_scoreCross-EncoderBi-Encoder
0A man with a hard hat is dancing.A man wearing a hard hat is dancing.5.01.00.6089210.979907
1A man is fitting silencer on a pistol.A man is adding a silencer to a gun.4.50.90.6140190.874845
2Kittens are eating food.Kittens are eating from dishes.4.00.80.6132550.872530
3A woman is mixing ingrediants.A woman is mixing food in a bowl.3.50.70.6021670.440890
4A woman is cooking eggs.A woman is cooking something.3.00.60.5998420.619852
5Someone is beating an egg.A woman stirs eggs in a bowl.2.50.50.5937240.435095
6A small baby is playing a guitar.A boy sits on a bed, sings and plays a guitar.2.00.40.5939790.505967
7I think it is still feasible to store seeds un...I haven't tried storing tomato seeds myself, b...1.50.30.6034930.331625
8A man is playing soccer.A man is playing flute.1.00.20.5760150.094140
9Two little girls are talking on the phone.A little girl is walking down the street.0.50.10.6143590.390625
10The man is riding a horse.A woman is using a hoe.0.00.00.599499-0.028931
\n", "
" ], "text/plain": [ " sentence1 \\\n", "0 A man with a hard hat is dancing. \n", "1 A man is fitting silencer on a pistol. \n", "2 Kittens are eating food. \n", "3 A woman is mixing ingrediants. \n", "4 A woman is cooking eggs. \n", "5 Someone is beating an egg. \n", "6 A small baby is playing a guitar. \n", "7 I think it is still feasible to store seeds un... \n", "8 A man is playing soccer. \n", "9 Two little girls are talking on the phone. \n", "10 The man is riding a horse. \n", "\n", " sentence2 score \\\n", "0 A man wearing a hard hat is dancing. 5.0 \n", "1 A man is adding a silencer to a gun. 4.5 \n", "2 Kittens are eating from dishes. 4.0 \n", "3 A woman is mixing food in a bowl. 3.5 \n", "4 A woman is cooking something. 3.0 \n", "5 A woman stirs eggs in a bowl. 2.5 \n", "6 A boy sits on a bed, sings and plays a guitar. 2.0 \n", "7 I haven't tried storing tomato seeds myself, b... 1.5 \n", "8 A man is playing flute. 1.0 \n", "9 A little girl is walking down the street. 0.5 \n", "10 A woman is using a hoe. 0.0 \n", "\n", " normalized_score Cross-Encoder Bi-Encoder \n", "0 1.0 0.608921 0.979907 \n", "1 0.9 0.614019 0.874845 \n", "2 0.8 0.613255 0.872530 \n", "3 0.7 0.602167 0.440890 \n", "4 0.6 0.599842 0.619852 \n", "5 0.5 0.593724 0.435095 \n", "6 0.4 0.593979 0.505967 \n", "7 0.3 0.603493 0.331625 \n", "8 0.2 0.576015 0.094140 \n", "9 0.1 0.614359 0.390625 \n", "10 0.0 0.599499 -0.028931 " ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_samples['normalized_score'] = test_samples['score'] / 5.0\n", "test_samples['Cross-Encoder'] = test_samples.apply(\n", " lambda x: cross_encoder_model.predict([x['sentence1'], x['sentence2']]), axis=1)\n", "test_samples['Bi-Encoder'] = test_samples.apply(\n", " lambda x: cosine_similarity([model.encode(x['sentence1'])],[model.encode(x['sentence2'])])[0][0], axis=1)\n", "test_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Other Text Comparisons\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Levenshtein Distance\n" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2scorenormalized_scoreCross-EncoderBi-EncoderLevenshtein
0A man with a hard hat is dancing.A man wearing a hard hat is dancing.5.01.00.6089210.9799070.861111
1A man is fitting silencer on a pistol.A man is adding a silencer to a gun.4.50.90.6140190.8748450.631579
2Kittens are eating food.Kittens are eating from dishes.4.00.80.6132550.8725300.741935
3A woman is mixing ingrediants.A woman is mixing food in a bowl.3.50.70.6021670.4408900.606061
4A woman is cooking eggs.A woman is cooking something.3.00.60.5998420.6198520.724138
5Someone is beating an egg.A woman stirs eggs in a bowl.2.50.50.5937240.4350950.310345
6A small baby is playing a guitar.A boy sits on a bed, sings and plays a guitar.2.00.40.5939790.5059670.456522
7I think it is still feasible to store seeds un...I haven't tried storing tomato seeds myself, b...1.50.30.6034930.3316250.347826
8A man is playing soccer.A man is playing flute.1.00.20.5760150.0941400.791667
9Two little girls are talking on the phone.A little girl is walking down the street.0.50.10.6143590.3906250.642857
10The man is riding a horse.A woman is using a hoe.0.00.00.599499-0.0289310.653846
\n", "
" ], "text/plain": [ " sentence1 \\\n", "0 A man with a hard hat is dancing. \n", "1 A man is fitting silencer on a pistol. \n", "2 Kittens are eating food. \n", "3 A woman is mixing ingrediants. \n", "4 A woman is cooking eggs. \n", "5 Someone is beating an egg. \n", "6 A small baby is playing a guitar. \n", "7 I think it is still feasible to store seeds un... \n", "8 A man is playing soccer. \n", "9 Two little girls are talking on the phone. \n", "10 The man is riding a horse. \n", "\n", " sentence2 score \\\n", "0 A man wearing a hard hat is dancing. 5.0 \n", "1 A man is adding a silencer to a gun. 4.5 \n", "2 Kittens are eating from dishes. 4.0 \n", "3 A woman is mixing food in a bowl. 3.5 \n", "4 A woman is cooking something. 3.0 \n", "5 A woman stirs eggs in a bowl. 2.5 \n", "6 A boy sits on a bed, sings and plays a guitar. 2.0 \n", "7 I haven't tried storing tomato seeds myself, b... 1.5 \n", "8 A man is playing flute. 1.0 \n", "9 A little girl is walking down the street. 0.5 \n", "10 A woman is using a hoe. 0.0 \n", "\n", " normalized_score Cross-Encoder Bi-Encoder Levenshtein \n", "0 1.0 0.608921 0.979907 0.861111 \n", "1 0.9 0.614019 0.874845 0.631579 \n", "2 0.8 0.613255 0.872530 0.741935 \n", "3 0.7 0.602167 0.440890 0.606061 \n", "4 0.6 0.599842 0.619852 0.724138 \n", "5 0.5 0.593724 0.435095 0.310345 \n", "6 0.4 0.593979 0.505967 0.456522 \n", "7 0.3 0.603493 0.331625 0.347826 \n", "8 0.2 0.576015 0.094140 0.791667 \n", "9 0.1 0.614359 0.390625 0.642857 \n", "10 0.0 0.599499 -0.028931 0.653846 " ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "test_samples['Levenshtein'] = test_samples.apply(\n", " lambda x: textdistance.levenshtein.normalized_similarity(x['sentence1'], x['sentence2']), axis=1)\n", "test_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### TF-IDF\n" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2scorenormalized_scoreCross-EncoderBi-EncoderLevenshteinTF-IDF
0A man with a hard hat is dancing.A man wearing a hard hat is dancing.5.01.00.6089210.9799070.8611110.716812
1A man is fitting silencer on a pistol.A man is adding a silencer to a gun.4.50.90.6140190.8748450.6315790.336097
2Kittens are eating food.Kittens are eating from dishes.4.00.80.6132550.8725300.7419350.510149
3A woman is mixing ingrediants.A woman is mixing food in a bowl.3.50.70.6021670.4408900.6060610.450176
4A woman is cooking eggs.A woman is cooking something.3.00.60.5998420.6198520.7241380.602975
5Someone is beating an egg.A woman stirs eggs in a bowl.2.50.50.5937240.4350950.3103450.000000
6A small baby is playing a guitar.A boy sits on a bed, sings and plays a guitar.2.00.40.5939790.5059670.4565220.087044
7I think it is still feasible to store seeds un...I haven't tried storing tomato seeds myself, b...1.50.30.6034930.3316250.3478260.143098
8A man is playing soccer.A man is playing flute.1.00.20.5760150.0941400.7916670.602975
9Two little girls are talking on the phone.A little girl is walking down the street.0.50.10.6143590.3906250.6428570.155929
10The man is riding a horse.A woman is using a hoe.0.00.00.599499-0.0289310.6538460.127360
\n", "
" ], "text/plain": [ " sentence1 \\\n", "0 A man with a hard hat is dancing. \n", "1 A man is fitting silencer on a pistol. \n", "2 Kittens are eating food. \n", "3 A woman is mixing ingrediants. \n", "4 A woman is cooking eggs. \n", "5 Someone is beating an egg. \n", "6 A small baby is playing a guitar. \n", "7 I think it is still feasible to store seeds un... \n", "8 A man is playing soccer. \n", "9 Two little girls are talking on the phone. \n", "10 The man is riding a horse. \n", "\n", " sentence2 score \\\n", "0 A man wearing a hard hat is dancing. 5.0 \n", "1 A man is adding a silencer to a gun. 4.5 \n", "2 Kittens are eating from dishes. 4.0 \n", "3 A woman is mixing food in a bowl. 3.5 \n", "4 A woman is cooking something. 3.0 \n", "5 A woman stirs eggs in a bowl. 2.5 \n", "6 A boy sits on a bed, sings and plays a guitar. 2.0 \n", "7 I haven't tried storing tomato seeds myself, b... 1.5 \n", "8 A man is playing flute. 1.0 \n", "9 A little girl is walking down the street. 0.5 \n", "10 A woman is using a hoe. 0.0 \n", "\n", " normalized_score Cross-Encoder Bi-Encoder Levenshtein TF-IDF \n", "0 1.0 0.608921 0.979907 0.861111 0.716812 \n", "1 0.9 0.614019 0.874845 0.631579 0.336097 \n", "2 0.8 0.613255 0.872530 0.741935 0.510149 \n", "3 0.7 0.602167 0.440890 0.606061 0.450176 \n", "4 0.6 0.599842 0.619852 0.724138 0.602975 \n", "5 0.5 0.593724 0.435095 0.310345 0.000000 \n", "6 0.4 0.593979 0.505967 0.456522 0.087044 \n", "7 0.3 0.603493 0.331625 0.347826 0.143098 \n", "8 0.2 0.576015 0.094140 0.791667 0.602975 \n", "9 0.1 0.614359 0.390625 0.642857 0.155929 \n", "10 0.0 0.599499 -0.028931 0.653846 0.127360 " ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tfidf_vectorizer = TfidfVectorizer()\n", "\n", "test_samples['TF-IDF'] = test_samples.apply(\n", " lambda x: cosine_similarity(tfidf_vectorizer.fit_transform([x['sentence1'], x['sentence2']]))[0][1], axis=1)\n", "test_samples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Random Forest\n" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "# Initialize the sentence transformer model\n", "bert_based_tokenizer = SentenceTransformer(model_save_path)\n", "# Scale features\n", "scaler = StandardScaler()\n", "\n", "def get_features(dataset):\n", " # Convert sentences to embeddings\n", " embeddings1 = bert_based_tokenizer.encode(dataset['sentence1'], convert_to_tensor=True).cpu()\n", " embeddings2 = bert_based_tokenizer.encode(dataset['sentence2'], convert_to_tensor=True).cpu()\n", " # Calculate the difference of embeddings as features\n", " features = abs(embeddings1 - embeddings2).numpy()\n", " # Labels\n", " labels = np.array(dataset['score']) / 5.0\n", " return scaler.fit_transform(features), labels\n", "\n", "X_train, y_train = get_features(train_dataset)\n", "X_test, y_test = get_features(test_dataset)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean Squared Error: 0.03355835914744937\n" ] } ], "source": [ "# Initialize the Random Forest regressor\n", "random_forest = RandomForestRegressor(n_estimators=100, random_state=42)\n", "\n", "# Train the model\n", "random_forest.fit(X_train, y_train)\n", "\n", "# Predict on the test set\n", "y_pred = random_forest.predict(X_test)\n", "\n", "# Calculate the mean squared error\n", "mse = mean_squared_error(y_test, y_pred)\n", "print(f\"Mean Squared Error: {mse}\")" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['trained_model_random_forest.joblib']" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Save the model\n", "joblib.dump(random_forest, 'trained_model_random_forest.joblib')" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sentence1sentence2scorenormalized_scoreCross-EncoderBi-EncoderLevenshteinTF-IDFRandomForest
0A man with a hard hat is dancing.A man wearing a hard hat is dancing.5.01.00.6089210.9799070.8611110.7168120.969502
1A man is fitting silencer on a pistol.A man is adding a silencer to a gun.4.50.90.6140190.8748450.6315790.3360970.748200
2Kittens are eating food.Kittens are eating from dishes.4.00.80.6132550.8725300.7419350.5101490.723038
3A woman is mixing ingrediants.A woman is mixing food in a bowl.3.50.70.6021670.4408900.6060610.4501760.372034
4A woman is cooking eggs.A woman is cooking something.3.00.60.5998420.6198520.7241380.6029750.453100
5Someone is beating an egg.A woman stirs eggs in a bowl.2.50.50.5937240.4350950.3103450.0000000.312862
6A small baby is playing a guitar.A boy sits on a bed, sings and plays a guitar.2.00.40.5939790.5059670.4565220.0870440.345736
7I think it is still feasible to store seeds un...I haven't tried storing tomato seeds myself, b...1.50.30.6034930.3316250.3478260.1430980.440634
8A man is playing soccer.A man is playing flute.1.00.20.5760150.0941400.7916670.6029750.217502
9Two little girls are talking on the phone.A little girl is walking down the street.0.50.10.6143590.3906250.6428570.1559290.411490
10The man is riding a horse.A woman is using a hoe.0.00.00.599499-0.0289310.6538460.1273600.235032
\n", "
" ], "text/plain": [ " sentence1 \\\n", "0 A man with a hard hat is dancing. \n", "1 A man is fitting silencer on a pistol. \n", "2 Kittens are eating food. \n", "3 A woman is mixing ingrediants. \n", "4 A woman is cooking eggs. \n", "5 Someone is beating an egg. \n", "6 A small baby is playing a guitar. \n", "7 I think it is still feasible to store seeds un... \n", "8 A man is playing soccer. \n", "9 Two little girls are talking on the phone. \n", "10 The man is riding a horse. \n", "\n", " sentence2 score \\\n", "0 A man wearing a hard hat is dancing. 5.0 \n", "1 A man is adding a silencer to a gun. 4.5 \n", "2 Kittens are eating from dishes. 4.0 \n", "3 A woman is mixing food in a bowl. 3.5 \n", "4 A woman is cooking something. 3.0 \n", "5 A woman stirs eggs in a bowl. 2.5 \n", "6 A boy sits on a bed, sings and plays a guitar. 2.0 \n", "7 I haven't tried storing tomato seeds myself, b... 1.5 \n", "8 A man is playing flute. 1.0 \n", "9 A little girl is walking down the street. 0.5 \n", "10 A woman is using a hoe. 0.0 \n", "\n", " normalized_score Cross-Encoder Bi-Encoder Levenshtein TF-IDF \\\n", "0 1.0 0.608921 0.979907 0.861111 0.716812 \n", "1 0.9 0.614019 0.874845 0.631579 0.336097 \n", "2 0.8 0.613255 0.872530 0.741935 0.510149 \n", "3 0.7 0.602167 0.440890 0.606061 0.450176 \n", "4 0.6 0.599842 0.619852 0.724138 0.602975 \n", "5 0.5 0.593724 0.435095 0.310345 0.000000 \n", "6 0.4 0.593979 0.505967 0.456522 0.087044 \n", "7 0.3 0.603493 0.331625 0.347826 0.143098 \n", "8 0.2 0.576015 0.094140 0.791667 0.602975 \n", "9 0.1 0.614359 0.390625 0.642857 0.155929 \n", "10 0.0 0.599499 -0.028931 0.653846 0.127360 \n", "\n", " RandomForest \n", "0 0.969502 \n", "1 0.748200 \n", "2 0.723038 \n", "3 0.372034 \n", "4 0.453100 \n", "5 0.312862 \n", "6 0.345736 \n", "7 0.440634 \n", "8 0.217502 \n", "9 0.411490 \n", "10 0.235032 " ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from encode_sentences import encode_sentences\n", "\n", "test_samples['RandomForest'] = test_samples.apply(\n", " lambda x: random_forest.predict(encode_sentences(model, x['sentence1'], x['sentence2']))[0], axis=1)\n", "test_samples" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [], "source": [ "%load_ext autoreload\n", "%autoreload 2" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.637266])" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from encode_sentences import encode_sentences\n", "\n", "joblib.load('trained_model_random_forest.joblib').predict(encode_sentences(model, 'sentence1', 'sentence2'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Deployment\n" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n", "To disable this warning, you can either:\n", "\t- Avoid using `tokenizers` before the fork if possible\n", "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0m\n", "\u001b[34m\u001b[1m You can now view your Streamlit app in your browser.\u001b[0m\n", "\u001b[0m\n", "\u001b[34m Local URL: \u001b[0m\u001b[1mhttp://localhost:8501\u001b[0m\n", "\u001b[34m Network URL: \u001b[0m\u001b[1mhttp://192.168.1.107:8501\u001b[0m\n", "\u001b[0m\n", "\u001b[34m\u001b[1m For better performance, install the Watchdog module:\u001b[0m\n", "\n", " $ xcode-select --install\n", " $ pip install watchdog\n", " \u001b[0m\n", "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at trained_model_stsbenchmark_bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "Some weights of BertForSequenceClassification were not initialized from the model checkpoint at trained_model_stsbenchmark_bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n", "^C\n", "\u001b[34m Stopping...\u001b[0m\n" ] } ], "source": [ "!streamlit run app.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code and deployment are at: https://huggingface.co/spaces/mckabue/text-similarity-prediction-and-analysis\n" ] } ], "metadata": { "kernelspec": { "display_name": "dss-env", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 2 }