File size: 9,647 Bytes
cd24a6a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Briefly\n",
    "\n",
    "\n",
    "### __ Problem Statement __\n",
    "- Obtain news from google news articles\n",
    "- Sammarize the articles within 60 words\n",
    "- Obtain keywords from the articles\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "##### Importing all the necessary libraries required to run the following code "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gnewsclient import gnewsclient   # for fetching google news\n",
    "from newspaper import Article         # to obtain text from news articles\n",
    "from transformers import pipeline     # to summarize text\n",
    "import spacy                          # for named entity recognition"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Load sshleifer/distilbart-cnn-12-6 model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_model():                           \n",
    "    model = pipeline('summarization')\n",
    "    return model\n",
    "data = gnewsclient.NewsClient(max_results=0)\n",
    "nlp = spacy.load(\"en_core_web_lg\") "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Obtain urls and it's content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getNews(topic,location):                \n",
    "    count=0\n",
    "    contents=[]\n",
    "    titles=[]\n",
    "    authors=[]\n",
    "    urls=[]\n",
    "    data = gnewsclient.NewsClient(language='english',location=location,topic=topic,max_results=10) \n",
    "    news = data.get_news()  \n",
    "    for item in news:\n",
    "        url=item['link']\n",
    "        article = Article(url)\n",
    "        try:\n",
    "            article.download()\n",
    "            article.parse()\n",
    "            temp=item['title'][::-1]\n",
    "            index=temp.find(\"-\")\n",
    "            temp=temp[:index-1][::-1]\n",
    "            urls.append(url)\n",
    "            contents.append(article.text)\n",
    "            titles.append(item['title'][:-index-1])    \n",
    "            authors.append(temp)\n",
    "            count+=1\n",
    "            if(count==5):\n",
    "                break\n",
    "        except:\n",
    "            continue \n",
    "    return contents,titles,authors,urls         "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Summarizes the content- minimum word limit 30 and maximum 60"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def getNewsSummary(contents,summarizer):   \n",
    "    summaries=[]     \n",
    "    for content in contents:\n",
    "        minimum=len(content.split())\n",
    "        summaries.append(summarizer(content,max_length=60,min_length=min(30,minimum),do_sample=False,truncation=True)[0]['summary_text'])   \n",
    "    return summaries"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Obtain 4 keywords from content (person,organisation or geopolitical entity) \n",
    "def generateKeyword(contents):            \n",
    "    keywords=[]\n",
    "    words=[]      \n",
    "    labels=[\"PERSON\",\"ORG\",\"GPE\"]\n",
    "    for content in contents:\n",
    "        doc=nlp(content)\n",
    "        keys=[]\n",
    "        limit=0\n",
    "        for ent in doc.ents:\n",
    "            key=ent.text.upper()\n",
    "            label=ent.label_\n",
    "            if(key not in words and key not in keywords and label in labels): \n",
    "                keys.append(key)\n",
    "                limit+=1\n",
    "                for element in key.split():\n",
    "                    words.append(element)\n",
    "            if(limit==4):\n",
    "                keywords.append(keys)\n",
    "                break                           \n",
    "    return keywords\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Displaying keywords "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def printKeywords(keywords):\n",
    "    for keyword in keywords:\n",
    "        print(keyword)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Displaying the Summary with keywords in it highlighted"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def printSummary(summaries,titles):\n",
    "    for summary,title in zip(summaries,titles):\n",
    "        print(title.upper(),'\\n')\n",
    "        print(summary)\n",
    "        print(\"\\n\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)\n"
     ]
    }
   ],
   "source": [
    "summarizer=load_model() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "contents,titles,authors,urls=getNews(\"Sports\",\"India\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "summaries=getNewsSummary(contents,summarizer)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "keywords=generateKeyword(contents)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['INDIA', 'SCOTLAND', 'SUPER 12', 'DUBAI']\n",
      "[\"VIRAT KOHLI'S\", 'TEAM INDIA', 'DHONI', 'UAE']\n",
      "['AUSTRALIA', 'AFGHANISTAN', 'CRICKET AUSTRALIA', 'CRICBUZZ STAFF •']\n",
      "['GARY STEAD', 'TRENT BOULT', 'COLIN DE GRANDHOMME', 'BLACKCAPS']\n",
      "['DWAYNE BRAVO', 'SRI LANKA', 'ICC', 'THE WEST INDIES']\n"
     ]
    }
   ],
   "source": [
    "printKeywords(keywords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "T20 WORLD CUP 2021, IND VS SCO PREVIEW: INDIA FACE SCOTLAND, EYE ANOTHER BIG WIN  \n",
      "\n",
      " India take on Scotland in a Super 12 clash of the 2021 T20 World Cup in Dubai on Friday . Virat Kohli-led side beat Afghanistan by 66 runs in Abu Dhabi on Wednesday . India must win their remaining two games while maintaining high run rates and hope for New Zealand to\n",
      "\n",
      "\n",
      "\n",
      "‘THERE ARE MANY CANDIDATES BUT HE’S THE BEST': SEHWAG PICKS NEXT INDIA CAPTAIN AFTER KOHLI STEPS DOWN AT END OF T20 WC  \n",
      "\n",
      " Virat Kohli set to step down as T20I captain after this World Cup in UAE and Oman . Many experts are anticipating his deputy Rohit Sharma to fill up the position . Former India opener Virender Sehwag backed Rohit as the ideal candidate .\n",
      "\n",
      "\n",
      "\n",
      "ONE-OFF TEST VS AFGHANISTAN POSTPONED, CONFIRMS CRICKET AUSTRALIA | CRICBUZZ.COM - CRICBUZZ  \n",
      "\n",
      " Cricket Australia's one-off Test against Afghanistan has officially been postponed . The historic Test has been hanging in the balance since the CA revealed that they wouldn't support the Taliban government's stance against the inclusion of women in sports . Instead of cancelling the Test match, CA has vowed to\n",
      "\n",
      "\n",
      "\n",
      "NEW ZEALAND INCLUDE FIVE SPINNERS FOR INDIA TOUR, TRENT BOULT OPTS OUT CITING BUBBLE FATIGUE  \n",
      "\n",
      " New Zealand name five spinners in 15-man squad for two-Test series against India . Senior pacer Trent Boult and fast-bowling all-rounder Colin de Grandhomme will miss tour due to bio-bubble fatigue . Ajaz Patel, Will Somerville and\n",
      "\n",
      "\n",
      "\n",
      "T20 WORLD CUP 2021: WEST INDIES AND CHENNAI SUPER KINGS ALL-ROUNDER DWAYNE BRAVO TO RETIRE AFTER SHOWPIECE...  \n",
      "\n",
      " West Indies all-rounder Dwayne Bravo will hang his boots at the end of the ICC T20 World Cup 2021 . Bravo told ICC on the post-match Facebook Live show that he will be drawing the curtains on his international career . West Indies lost to Sri Lanka by 20 runs in\n",
      "\n",
      "\n",
      "\n"
     ]
    }
   ],
   "source": [
    "printSummary(summaries,titles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}