Spaces:

Devika
/

Briefly

Runtime error

File size: 9,647 Bytes

cd24a6a

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Briefly\n",
    "\n",
    "\n",
    "### __ Problem Statement __\n",
    "- Obtain news from google news articles\n",
    "- Sammarize the articles within 60 words\n",
    "- Obtain keywords from the articles\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "##### Importing all the necessary libraries required to run the following code "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gnewsclient import gnewsclient   # for fetching google news\n",
    "from newspaper import Article         # to obtain text from news articles\n",
    "from transformers import pipeline     # to summarize text\n",
    "import spacy                          # for named entity recognition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Load sshleifer/distilbart-cnn-12-6 model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_model():                           \n",
    "    model = pipeline('summarization')\n",
    "    return model\n",
    "data = gnewsclient.NewsClient(max_results=0)\n",
    "nlp = spacy.load(\"en_core_web_lg\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Obtain urls and it's content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getNews(topic,location):                \n",
    "    count=0\n",
    "    contents=[]\n",
    "    titles=[]\n",
    "    authors=[]\n",
    "    urls=[]\n",
    "    data = gnewsclient.NewsClient(language='english',location=location,topic=topic,max_results=10) \n",
    "    news = data.get_news()  \n",
    "    for item in news:\n",
    "        url=item['link']\n",
    "        article = Article(url)\n",
    "        try:\n",
    "            article.download()\n",
    "            article.parse()\n",
    "            temp=item['title'][::-1]\n",
    "            index=temp.find(\"-\")\n",
    "            temp=temp[:index-1][::-1]\n",
    "            urls.append(url)\n",
    "            contents.append(article.text)\n",
    "            titles.append(item['title'][:-index-1])    \n",
    "            authors.append(temp)\n",
    "            count+=1\n",
    "            if(count==5):\n",
    "                break\n",
    "        except:\n",
    "            continue \n",
    "    return contents,titles,authors,urls         "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Summarizes the content- minimum word limit 30 and maximum 60"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getNewsSummary(contents,summarizer):   \n",
    "    summaries=[]     \n",
    "    for content in contents:\n",
    "        minimum=len(content.split())\n",
    "        summaries.append(summarizer(content,max_length=60,min_length=min(30,minimum),do_sample=False,truncation=True)[0]['summary_text'])   \n",
    "    return summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Obtain 4 keywords from content (person,organisation or geopolitical entity) \n",
    "def generateKeyword(contents):            \n",
    "    keywords=[]\n",
    "    words=[]      \n",
    "    labels=[\"PERSON\",\"ORG\",\"GPE\"]\n",
    "    for content in contents:\n",
    "        doc=nlp(content)\n",
    "        keys=[]\n",
    "        limit=0\n",
    "        for ent in doc.ents:\n",
    "            key=ent.text.upper()\n",
    "            label=ent.label_\n",
    "            if(key not in words and key not in keywords and label in labels): \n",
    "                keys.append(key)\n",
    "                limit+=1\n",
    "                for element in key.split():\n",
    "                    words.append(element)\n",
    "            if(limit==4):\n",
    "                keywords.append(keys)\n",
    "                break                           \n",
    "    return keywords\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Displaying keywords "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def printKeywords(keywords):\n",
    "    for keyword in keywords:\n",
    "        print(keyword)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Displaying the Summary with keywords in it highlighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def printSummary(summaries,titles):\n",
    "    for summary,title in zip(summaries,titles):\n",
    "        print(title.upper(),'\\n')\n",
    "        print(summary)\n",
    "        print(\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)\n"
     ]
    }
   ],
   "source": [
    "summarizer=load_model() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "contents,titles,authors,urls=getNews(\"Sports\",\"India\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "summaries=getNewsSummary(contents,summarizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "keywords=generateKeyword(contents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['INDIA', 'SCOTLAND', 'SUPER 12', 'DUBAI']\n",
      "[\"VIRAT KOHLI'S\", 'TEAM INDIA', 'DHONI', 'UAE']\n",
      "['AUSTRALIA', 'AFGHANISTAN', 'CRICKET AUSTRALIA', 'CRICBUZZ STAFF •']\n",
      "['GARY STEAD', 'TRENT BOULT', 'COLIN DE GRANDHOMME', 'BLACKCAPS']\n",
      "['DWAYNE BRAVO', 'SRI LANKA', 'ICC', 'THE WEST INDIES']\n"
     ]
    }
   ],
   "source": [
    "printKeywords(keywords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T20 WORLD CUP 2021, IND VS SCO PREVIEW: INDIA FACE SCOTLAND, EYE ANOTHER BIG WIN  \n",
      "\n",
      " India take on Scotland in a Super 12 clash of the 2021 T20 World Cup in Dubai on Friday . Virat Kohli-led side beat Afghanistan by 66 runs in Abu Dhabi on Wednesday . India must win their remaining two games while maintaining high run rates and hope for New Zealand to\n",
      "\n",
      "\n",
      "\n",
      "‘THERE ARE MANY CANDIDATES BUT HE’S THE BEST': SEHWAG PICKS NEXT INDIA CAPTAIN AFTER KOHLI STEPS DOWN AT END OF T20 WC  \n",
      "\n",
      " Virat Kohli set to step down as T20I captain after this World Cup in UAE and Oman . Many experts are anticipating his deputy Rohit Sharma to fill up the position . Former India opener Virender Sehwag backed Rohit as the ideal candidate .\n",
      "\n",
      "\n",
      "\n",
      "ONE-OFF TEST VS AFGHANISTAN POSTPONED, CONFIRMS CRICKET AUSTRALIA | CRICBUZZ.COM - CRICBUZZ  \n",
      "\n",
      " Cricket Australia's one-off Test against Afghanistan has officially been postponed . The historic Test has been hanging in the balance since the CA revealed that they wouldn't support the Taliban government's stance against the inclusion of women in sports . Instead of cancelling the Test match, CA has vowed to\n",
      "\n",
      "\n",
      "\n",
      "NEW ZEALAND INCLUDE FIVE SPINNERS FOR INDIA TOUR, TRENT BOULT OPTS OUT CITING BUBBLE FATIGUE  \n",
      "\n",
      " New Zealand name five spinners in 15-man squad for two-Test series against India . Senior pacer Trent Boult and fast-bowling all-rounder Colin de Grandhomme will miss tour due to bio-bubble fatigue . Ajaz Patel, Will Somerville and\n",
      "\n",
      "\n",
      "\n",
      "T20 WORLD CUP 2021: WEST INDIES AND CHENNAI SUPER KINGS ALL-ROUNDER DWAYNE BRAVO TO RETIRE AFTER SHOWPIECE...  \n",
      "\n",
      " West Indies all-rounder Dwayne Bravo will hang his boots at the end of the ICC T20 World Cup 2021 . Bravo told ICC on the post-match Facebook Live show that he will be drawing the curtains on his international career . West Indies lost to Sri Lanka by 20 runs in\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "printSummary(summaries,titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}