{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Viral Tweets: User exploration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this notebook, we will explore the users who have tweeted viral tweets. Namely, we will focus our analysis on the viral tweets from the user point of view. For example, we'll examine the popularity of the user vs the popularity of his tweets, the history of his tweets and analyze any flagrant changes in their features when they became viral, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0 - Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "import numpy as np\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "from tqdm import tqdm\n",
    "\n",
    "#pd.set_option('display.max_rows', None)\n",
    "pd.set_option('display.max_columns', None)\n",
    "\n",
    "DATA_PATH = \"../../data\"\n",
    "VIRAL_TWEETS_PATH = f\"{DATA_PATH}/viral_users\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from helper.twitter_client_wrapper import TwitterClientWrapper, EXPANSIONS, MEDIA_FIELDS, TWEET_FIELDS, USER_FIELDS\n",
    "\n",
    "twitter_client_wrapper = TwitterClientWrapper(\"../../api_key.yaml\", wait_on_rate_limit=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1 - Retrieve the data from disk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 Retrieve the viral tweets data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: You may notice that all tweets have been retrieved, since some may have been deleted since scraping them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note 2**: Also keep in mind that when retrieving users, the number of users may be less because users may have two or more viral tweets in the sample of viral tweets we have.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# dtypes={\"id\": str, \"author_id\": str, \"has_media\": bool, \"possibly_sensitive\": bool}\n",
    "dtypes={\"id\": str, \"author_id\": str}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\steph\\AppData\\Local\\Temp\\ipykernel_18728\\1524257405.py:2: DtypeWarning: Columns (3,8,14,17,18,19,20,21,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False.\n",
      "  viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes, escapechar='\\\\', encoding='utf-8')\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_at</th>\n",
       "      <th>author_id</th>\n",
       "      <th>text</th>\n",
       "      <th>possibly_sensitive</th>\n",
       "      <th>edit_history_tweet_ids</th>\n",
       "      <th>lang</th>\n",
       "      <th>id</th>\n",
       "      <th>mentions</th>\n",
       "      <th>retweet_count</th>\n",
       "      <th>reply_count</th>\n",
       "      <th>like_count</th>\n",
       "      <th>quote_count</th>\n",
       "      <th>context_annotations</th>\n",
       "      <th>urls</th>\n",
       "      <th>has_media</th>\n",
       "      <th>annotations</th>\n",
       "      <th>hashtags</th>\n",
       "      <th>attachments.poll_ids</th>\n",
       "      <th>withheld.copyright</th>\n",
       "      <th>withheld.country_codes</th>\n",
       "      <th>withheld.scope</th>\n",
       "      <th>cashtags</th>\n",
       "      <th>geo.place_id</th>\n",
       "      <th>geo.coordinates.type</th>\n",
       "      <th>geo.coordinates.coordinates</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2022-10-31T03:21:11.000Z</td>\n",
       "      <td>1047733077898739712</td>\n",
       "      <td>@manjirosx you too jiro🫶🏽</td>\n",
       "      <td>False</td>\n",
       "      <td>['1586921195059834880']</td>\n",
       "      <td>en</td>\n",
       "      <td>1586921195059834880</td>\n",
       "      <td>[{'start': 0, 'end': 10, 'username': 'manjiros...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2022-10-31T03:13:57.000Z</td>\n",
       "      <td>1047733077898739712</td>\n",
       "      <td>@ilyicey u omd</td>\n",
       "      <td>False</td>\n",
       "      <td>['1586919376086704129']</td>\n",
       "      <td>nl</td>\n",
       "      <td>1586919376086704129</td>\n",
       "      <td>[{'start': 0, 'end': 8, 'username': 'ilyicey',...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2022-10-31T03:13:24.000Z</td>\n",
       "      <td>1047733077898739712</td>\n",
       "      <td>@ilyicey i’m fine</td>\n",
       "      <td>False</td>\n",
       "      <td>['1586919239243296768']</td>\n",
       "      <td>en</td>\n",
       "      <td>1586919239243296768</td>\n",
       "      <td>[{'start': 0, 'end': 8, 'username': 'ilyicey',...</td>\n",
       "      <td>1.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2022-10-30T22:49:53.000Z</td>\n",
       "      <td>1047733077898739712</td>\n",
       "      <td>@imVolo_ I’ll unfollow rn</td>\n",
       "      <td>False</td>\n",
       "      <td>['1586852923706732544']</td>\n",
       "      <td>en</td>\n",
       "      <td>1586852923706732544</td>\n",
       "      <td>[{'start': 0, 'end': 8, 'username': 'imVolo_',...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2022-10-30T22:45:33.000Z</td>\n",
       "      <td>1047733077898739712</td>\n",
       "      <td>“what do you want to be for halloween?” his li...</td>\n",
       "      <td>False</td>\n",
       "      <td>['1586851830767591424']</td>\n",
       "      <td>en</td>\n",
       "      <td>1586851830767591424</td>\n",
       "      <td>NaN</td>\n",
       "      <td>611.0</td>\n",
       "      <td>19.0</td>\n",
       "      <td>4132.0</td>\n",
       "      <td>55.0</td>\n",
       "      <td>[{'domain': {'id': '29', 'name': 'Events [Enti...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>False</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 created_at            author_id  \\\n",
       "0  2022-10-31T03:21:11.000Z  1047733077898739712   \n",
       "1  2022-10-31T03:13:57.000Z  1047733077898739712   \n",
       "2  2022-10-31T03:13:24.000Z  1047733077898739712   \n",
       "3  2022-10-30T22:49:53.000Z  1047733077898739712   \n",
       "4  2022-10-30T22:45:33.000Z  1047733077898739712   \n",
       "\n",
       "                                                text possibly_sensitive  \\\n",
       "0                          @manjirosx you too jiro🫶🏽              False   \n",
       "1                                     @ilyicey u omd              False   \n",
       "2                                  @ilyicey i’m fine              False   \n",
       "3                          @imVolo_ I’ll unfollow rn              False   \n",
       "4  “what do you want to be for halloween?” his li...              False   \n",
       "\n",
       "    edit_history_tweet_ids lang                   id  \\\n",
       "0  ['1586921195059834880']   en  1586921195059834880   \n",
       "1  ['1586919376086704129']   nl  1586919376086704129   \n",
       "2  ['1586919239243296768']   en  1586919239243296768   \n",
       "3  ['1586852923706732544']   en  1586852923706732544   \n",
       "4  ['1586851830767591424']   en  1586851830767591424   \n",
       "\n",
       "                                            mentions retweet_count  \\\n",
       "0  [{'start': 0, 'end': 10, 'username': 'manjiros...           0.0   \n",
       "1  [{'start': 0, 'end': 8, 'username': 'ilyicey',...           0.0   \n",
       "2  [{'start': 0, 'end': 8, 'username': 'ilyicey',...           1.0   \n",
       "3  [{'start': 0, 'end': 8, 'username': 'imVolo_',...           0.0   \n",
       "4                                                NaN         611.0   \n",
       "\n",
       "   reply_count  like_count  quote_count  \\\n",
       "0          0.0         1.0          0.0   \n",
       "1          0.0         0.0          0.0   \n",
       "2          1.0         2.0          0.0   \n",
       "3          0.0         3.0          0.0   \n",
       "4         19.0      4132.0         55.0   \n",
       "\n",
       "                                 context_annotations urls has_media  \\\n",
       "0                                                NaN  NaN     False   \n",
       "1                                                NaN  NaN     False   \n",
       "2                                                NaN  NaN     False   \n",
       "3                                                NaN  NaN     False   \n",
       "4  [{'domain': {'id': '29', 'name': 'Events [Enti...  NaN     False   \n",
       "\n",
       "  annotations hashtags attachments.poll_ids withheld.copyright  \\\n",
       "0         NaN      NaN                  NaN                NaN   \n",
       "1         NaN      NaN                  NaN                NaN   \n",
       "2         NaN      NaN                  NaN                NaN   \n",
       "3         NaN      NaN                  NaN                NaN   \n",
       "4         NaN      NaN                  NaN                NaN   \n",
       "\n",
       "  withheld.country_codes withheld.scope cashtags geo.place_id  \\\n",
       "0                    NaN            NaN      NaN          NaN   \n",
       "1                    NaN            NaN      NaN          NaN   \n",
       "2                    NaN            NaN      NaN          NaN   \n",
       "3                    NaN            NaN      NaN          NaN   \n",
       "4                    NaN            NaN      NaN          NaN   \n",
       "\n",
       "  geo.coordinates.type geo.coordinates.coordinates  \n",
       "0                  NaN                         NaN  \n",
       "1                  NaN                         NaN  \n",
       "2                  NaN                         NaN  \n",
       "3                  NaN                         NaN  \n",
       "4                  NaN                         NaN  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Import tweets first\n",
    "viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes, escapechar='\\\\', encoding='utf-8')\n",
    "# viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes)\n",
    "viral_tweets_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'RT @strbrkrr: apple be like \"high volume may damage your ears...\" ok… i don’t care'"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "viral_tweets_df[~viral_tweets_df.annotations.isna()].text.iloc[10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 - Retrieve viral tweets users"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We start by retrieving the viral tweets users. Users are **included as expansions** when retrieving the tweets, conveniently so. For each user, we retrieve this user's history and information."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Retrieve the user id. The user data is included in the 'includes' field which we get by if we retrieve any expansions\n",
    "users_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/users.csv\", dtype={\"id\": str, \"pinned_tweet_id\": str}, escapechar=\"\\\\\")\n",
    "users_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "id                        object\n",
    "edit_history_tweet_ids    object\n",
    "author_id                 object\n",
    "created_at                object\n",
    "possibly_sensitive          bool\n",
    "text                      object\n",
    "retweet_count              int64\n",
    "reply_count                int64\n",
    "like_count                 int64\n",
    "quote_count                int64\n",
    "has_media                   bool\n",
    "urls                      object\n",
    "context_annotations       object\n",
    "annotations               object\n",
    "hashtags                  object\n",
    "geo.place_id              object\n",
    "mentions                  object\n",
    "dtype: object\n",
    "'''\n",
    "viral_tweets_df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2 - Analysis of single user"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's observe the tweets of single user who has tweeted viral tweets. We'll try to conduct some analysis on their features to try and see what changed in the tweets of the user over time, and how they reflect the changes in the behaviour of the user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Take first user\n",
    "user_id = users_df.iloc[0].id"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "user_tweets = viral_tweets_df[viral_tweets_df.author_id == user_id]\n",
    "user_tweets['created_at'] = pd.to_datetime(user_tweets.created_at)\n",
    "user_tweets.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(10,5))\n",
    "\n",
    "ax[0].set_title(\"Retweet Count vs Tweet Date\")\n",
    "sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0])\n",
    "\n",
    "ax[1].set_title(\"Like Count vs Tweet Date\")\n",
    "sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[1])\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 2, figsize=(10,5))\n",
    "\n",
    "user_tweets['tweet_length'] = user_tweets['text'].apply(len)\n",
    "\n",
    "ax[0].set_title(\"Retweet Count vs Tweet Length\")\n",
    "sns.lineplot(user_tweets, x='tweet_length', y='retweet_count', ax=ax[0])\n",
    "\n",
    "ax[1].set_title(\"Like Count vs Tweet Length\")\n",
    "sns.lineplot(user_tweets, x='tweet_length', y='like_count', ax=ax[1])\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Has media\n",
    "sns.jointplot(user_tweets, x='has_media', y='retweet_count')\n",
    "\n",
    "plt.suptitle(\"# Retweets vs Tweet has media\")\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(user_tweets[['tweet_length', 'has_media', 'retweet_count', 'like_count']])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(2, 2, figsize=(10,5))\n",
    "\n",
    "user_tweets['tweet_length'] = user_tweets['text'].apply(len)\n",
    "\n",
    "ax[0][0].set_title(\"Retweet Count vs Date\")\n",
    "sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0][0])\n",
    "\n",
    "ax[0][1].set_title(\"Like Count vs Date\")\n",
    "sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[0][1])\n",
    "\n",
    "ax[1][0].set_title(\"Has Media vs Date\")\n",
    "sns.scatterplot(user_tweets, x='created_at', y='has_media', ax=ax[1][0])\n",
    "\n",
    "ax[1][1].set_title(\"Tweet Length vs Date\")\n",
    "sns.scatterplot(user_tweets, x='created_at', y='tweet_length', ax=ax[1][1])\n",
    "\n",
    "plt.tight_layout()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### TODO: Analyze the change in tweet features depending on date (one row depending on date, other depending on retweet count to reflect the evolution)\n",
    "### TODO: Concentration on topics [group by topics for a sample user]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 - Aggregate Analysis of all viral users tweets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.0 - How many tweets per user retrieved"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets_per_user = viral_tweets_df.groupby(by='author_id').size().reset_index(name='count')\n",
    "tweets_per_user.sort_values(by='count')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets_per_user.hist(column='count', bins=10)\n",
    "plt.title(\"Histogram of distribution of number of tweets retrieved per user\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.1 - Retweet count vs like count"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to come up with a metric for the **virality** of the tweet, we need to know which features we will use to determine this metric. *retweet_count* and *like_count* will surely be among those features selected. Let's how the two correlate."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**NOTE**: \"The retweet will not show the likes and replies, only retweet count. You need to get the counts from the original tweet, which would be referenced in referenced_tweets and included in includes.tweets part of the response.\" - Twitter Community"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Remove all tweets that might be retweets of others\n",
    "retweeted = viral_tweets_df.retweet_count !=0\n",
    "liked = viral_tweets_df.like_count !=0\n",
    "original_tweets_df = viral_tweets_df[retweeted & liked]\n",
    "\n",
    "# Remove NA in retweet and like count\n",
    "original_tweets_df = original_tweets_df.dropna(axis=0, subset=['retweet_count', 'like_count'])\n",
    "\n",
    "sns.scatterplot(data=original_tweets_df, x='retweet_count', y='like_count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: We can see more or less a linear correlation. Especially for lower numbers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.2 - (# Retweets / # followers ) ratio \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here a viable metric for a viral tweet can be the ratio between the retweets (or like) count over the followers count of the user. The idea here is that a user who doesn't have many followers, but has tweeted tweets that have garnered a lot of retweets or likes, can most definitely be considered \"viral\". On the other hand, a user who has many followers can have a standard high # retweets and those cannot be considered viral all the time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Also note that historical data for the evolution of the # of followers of a user are not easily available and are not provided by the Twitter API. So these calculated ratios do not reflect the actual ratio when the tweet has been tweeted by a user, since by then he may have gained a lot of followers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_tweets_df_subset = original_tweets_df[['id', 'author_id', 'retweet_count', 'like_count']]\n",
    "\n",
    "# Remove NA in follower count\n",
    "users_df_subset = users_df.dropna(axis=0, subset=['followers_count'])\n",
    "\n",
    "# Merge both on author id\n",
    "tweets_users_merged_df = viral_tweets_df_subset.merge(\n",
    "    right=users_df_subset[['id', 'followers_count']].set_index('id'), left_on='author_id', right_on='id')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tweets_users_merged_df['retweets_followers_ratio'] = tweets_users_merged_df['retweet_count'] / tweets_users_merged_df['followers_count']\n",
    "tweets_users_merged_df.sort_values(by='retweets_followers_ratio')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import plotly.express as px\n",
    "\n",
    "df_ratios_bigger_than_1 = tweets_users_merged_df[tweets_users_merged_df.retweets_followers_ratio > 1.0]\n",
    "fig = px.histogram(\n",
    "    df_ratios_bigger_than_1,\n",
    "    x=\"retweets_followers_ratio\",\n",
    "    nbins=10,\n",
    "    log_y=True)\n",
    "\n",
    "fig.update_layout(\n",
    "    title={\n",
    "        'text': \"Histogram of the distribution of the retweets/followers ratio > 1\",\n",
    "        'y':0.9,\n",
    "        'x':0.5,\n",
    "        'xanchor': 'center',\n",
    "        'yanchor': 'top'})\n",
    "\n",
    "\n",
    "fig.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The histogram is not very clear, since we have rare events where the tweets garnered so much popularity wrt the popularity of the user. Those we can definitely consider as viral Maybe we can try K-means to better identify these outliers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "n_clusters = 3\n",
    "X = np.array(df_ratios_bigger_than_1[['retweet_count', 'followers_count']])\n",
    "#X = np.vstack((df_ratios_bigger_than_1.retweet_count.to_numpy(), df_ratios_bigger_than_1.followers_count.to_numpy()))\n",
    "#X = df_ratios_bigger_than_1.retweets_followers_ratio.to_numpy().reshape(-1, 1)\n",
    "ratio_kmeans = KMeans(n_clusters=n_clusters, random_state=123).fit(X)\n",
    "\n",
    "#np.vstack((X[:, 0], X[:, 1], ratio_kmeans.labels_)).reshape(-1, 3)\n",
    "#px.scatter(ratio_kmeans, x=)\n",
    "'''\n",
    "plt.title(f'K-Means clustering of #retweets/#followers ratio with k={n_clusters}')\n",
    "plt.xlabel('Retweets')\n",
    "plt.ylabel('Followers')\n",
    "plt.scatter(X[:, 0], X[:, 1], c=ratio_kmeans.labels_)\n",
    "'''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kmeans_results_df = pd.DataFrame(X, columns=['retweet_count', 'follower_count']) \n",
    "kmeans_results_df['label'] = ratio_kmeans.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "px.scatter(kmeans_results_df, x='follower_count', y='retweet_count', color='label')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.3 - Metric (# Retweets  / avg #retweets of a user)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').agg({'retweet_count': ['min', 'mean', 'max'], 'like_count': ['min', 'mean', 'max']})\n",
    "avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max'])\n",
    "avg_nb_retweets_per_user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ratio_retweet_avg_retweets_df = viral_tweets_df_subset.merge(avg_nb_retweets_per_user, on='author_id')\n",
    "ratio_retweet_avg_retweets_df['per_user_performance'] = ratio_retweet_avg_retweets_df['retweet_count'] / ratio_retweet_avg_retweets_df['mean']\n",
    "ratio_retweet_avg_retweets_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bigger_than_mean = ratio_retweet_avg_retweets_df[ratio_retweet_avg_retweets_df.per_user_performance > 1]\n",
    "hist = px.histogram(bigger_than_mean, x='per_user_performance', log_y=True)\n",
    "\n",
    "hist.update_layout(title_text=\"Distribution of tweet performance wrt avg #retweets per user\", xaxis_title=\"Tweet performance\", yaxis_title=\"log count\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Finding**: We established another metric by which we can judge the virality of a tweet, namely the number of retweets vs the average number of retweets per user. We can set a threshold (e.g. > 2) to decide whether a tweet is viral or not. We can also conduct further analysis over those tweets to determine what sets them apart from the others."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.4 - Tweet Topic (context annotations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What topics are available? Context annotations are Twitter's version of analyzing the topic of a tweet. They are defined as a context **domain** and **entity**. The domain is like a general topic and entity is like a subtopic or a specific topic within the general domain."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json \n",
    "\n",
    "tweets_with_topics = original_tweets_df.dropna(axis=0, subset='context_annotations')\n",
    "\n",
    "def topic_to_json(x):\n",
    "    try:\n",
    "        return json.loads(x.replace('\\'', '\"'))\n",
    "    except json.JSONDecodeError:\n",
    "        print(\"Nope\")\n",
    "        return []"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO tomorrow:\n",
    "- Try sample and make it work with context annotations.\n",
    "- Check if has media is not null\n",
    "- hashtags extract tags\n",
    "- Extract context annotations\n",
    "- Use Celia Bearer Token"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tweepy import Paginator, TooManyRequests\n",
    "client = twitter_client_wrapper.client\n",
    "#tweet_data = twitter_client_wrapper.client.get_users_tweets(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS, exclude='retweets')\n",
    "\n",
    "viral_users_tweets = []\n",
    "# Number of users processed so far\n",
    "try:\n",
    "    for tweet in Paginator(client.get_users_tweets, id='1482846121517096961', tweet_fields=TWEET_FIELDS, exclude=\"retweets\").flatten(limit=20):\n",
    "        viral_users_tweets.append(tweet.data)\n",
    "except TooManyRequests:\n",
    "    print(\"Hit Rate Limit\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "domains = {}\n",
    "entities = {}\n",
    "for tweet in viral_users_tweets:\n",
    "    context_annotations = tweet.get('context_annotations', [])\n",
    "    tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])\n",
    "    domains.update(tweet_topic_domains)\n",
    "    tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])\n",
    "    entities.update(tweet_topic_entities)\n",
    "    tweet['topic_domain'] = list(tweet_topic_domains.keys())\n",
    "    tweet['topic_entity'] = list(tweet_topic_entities.keys())\n",
    "    tweet.pop('context_annotations', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "with open('topic_domains.pickle', 'wb') as handle:\n",
    "    pickle.dump(entities, handle, protocol=pickle.HIGHEST_PROTOCOL)\n",
    "\n",
    "with open('topic_domains.pickle', 'rb') as handle:\n",
    "    b = pickle.load(handle)\n",
    "\n",
    "b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    with open('topic_domains.pickle', 'rb') as handle:\n",
    "        topic_domains = pickle.load(handle)\n",
    "except FileNotFoundError:\n",
    "    topic_domains = {}\n",
    "\n",
    "topic_domains"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp = pd.json_normalize(viral_users_tweets)\n",
    "#temp[temp.context_annotations.notna()]\n",
    "temp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "domains"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = pd.Series([b[item]['name'] for items in temp.topic_domain.values for item in items])\n",
    "s.groupby(s).count().sort_values()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "viral_users_tweets_2 = []\n",
    "# Number of users processed so far\n",
    "try:\n",
    "    for tweet in Paginator(client.get_users_tweets, id='848263392943058944', tweet_fields=TWEET_FIELDS, exclude=\"retweets\").flatten(limit=100):\n",
    "        viral_users_tweets_2.append(tweet.data)\n",
    "except TooManyRequests:\n",
    "    print(\"Hit Rate Limit\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "domains = {}\n",
    "entities = {}\n",
    "for tweet in viral_users_tweets_2:\n",
    "    context_annotations = tweet.get('context_annotations', [])\n",
    "    tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])\n",
    "    domains.update(tweet_topic_domains)\n",
    "    tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])\n",
    "    entities.update(tweet_topic_entities)\n",
    "    tweet['topic_domain'] = list(tweet_topic_domains.keys()) if len(tweet_topic_domains.keys()) > 0 else pd.NA\n",
    "    tweet['topic_entity'] = list(tweet_topic_entities.keys()) if len(tweet_topic_entities.keys()) > 0 else pd.NA\n",
    "    #tweet.pop('context_annotations', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp2_df = pd.json_normalize(viral_users_tweets_2)\n",
    "first_context = temp2_df[~temp2_df.topic_domain.isna()].topic_domain.iloc[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp2_df[~temp2_df['entities.hashtags'].isna()]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "temp2_df.to_csv(\"temp.csv\", index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ast\n",
    "\n",
    "temp2_read = pd.read_csv('temp.csv', converters={'context_annotations': lambda x: eval(x) if (x and len(x) > 0) else np.nan})\n",
    "first_context = temp2_read[~temp2_read.context_annotations.isna()].context_annotations.iloc[2]\n",
    "first_context"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval(first_context)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def format_context_annotations(context_annotations):\n",
    "    if (pd.isna(context_annotations)):\n",
    "        return []\n",
    "    else:\n",
    "        return json.loads(context_annotations)\n",
    "\n",
    "temp2_df.context_annotations.apply(format_context_annotations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame(viral_users_tweets_2, columns=TWEET_FIELDS).to_csv('temp_2.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#tweet_data = twitter_client_wrapper.client.get_tweet(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS)\n",
    "bytes(tweets_with_topics.iloc[1000].context_annotations, encoding='utf-8').decode('unicode_escape')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'46'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dtypes={\"id\": str, \"author_id\": str, \"has_media\": bool, \"possibly_sensitive\": bool, \"has_hashtags\": bool}\n",
    "temp3 = pd.read_csv(\"145371604-to-146944733.csv\", dtype=dtypes)\n",
    "d = temp3[~temp3.topic_domains.isna()].topic_domains.iloc[0]\n",
    "eval(d)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.5 - Tweet Sentiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.6 - Possibly sensitive"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.7 - Hashtags"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: has hashtags (using entities.hashtags)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 3.8 - Text preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TODO:\n",
    "- Sort by tweet date (check popularity)\n",
    "- Use Twitter lists to try and find\n",
    "- Check if reply or retweet"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.11 ('ada')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "71d2f77bccee14ca7852d7b7a1fa8ea4708b81087104d93973081337557f0ee6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}