{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Viral Tweets: User exploration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will explore the users who have tweeted viral tweets. Namely, we will focus our analysis on the viral tweets from the user point of view. For example, we'll examine the popularity of the user vs the popularity of his tweets, the history of his tweets and analyze any flagrant changes in their features when they became viral, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0 - Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "from tqdm import tqdm\n", "\n", "#pd.set_option('display.max_rows', None)\n", "pd.set_option('display.max_columns', None)\n", "\n", "DATA_PATH = \"../../data\"\n", "VIRAL_TWEETS_PATH = f\"{DATA_PATH}/viral_users\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from helper.twitter_client_wrapper import TwitterClientWrapper, EXPANSIONS, MEDIA_FIELDS, TWEET_FIELDS, USER_FIELDS\n", "\n", "twitter_client_wrapper = TwitterClientWrapper(\"../../api_key.yaml\", wait_on_rate_limit=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 - Retrieve the data from disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Retrieve the viral tweets data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: You may notice that all tweets have been retrieved, since some may have been deleted since scraping them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note 2**: Also keep in mind that when retrieving users, the number of users may be less because users may have two or more viral tweets in the sample of viral tweets we have. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# dtypes={\"id\": str, \"author_id\": str, \"has_media\": bool, \"possibly_sensitive\": bool}\n", "dtypes={\"id\": str, \"author_id\": str}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\steph\\AppData\\Local\\Temp\\ipykernel_18728\\1524257405.py:2: DtypeWarning: Columns (3,8,14,17,18,19,20,21,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False.\n", " viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes, escapechar='\\\\', encoding='utf-8')\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
created_atauthor_idtextpossibly_sensitiveedit_history_tweet_idslangidmentionsretweet_countreply_countlike_countquote_countcontext_annotationsurlshas_mediaannotationshashtagsattachments.poll_idswithheld.copyrightwithheld.country_codeswithheld.scopecashtagsgeo.place_idgeo.coordinates.typegeo.coordinates.coordinates
02022-10-31T03:21:11.000Z1047733077898739712@manjirosx you too jiro🫶🏽False['1586921195059834880']en1586921195059834880[{'start': 0, 'end': 10, 'username': 'manjiros...0.00.01.00.0NaNNaNFalseNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
12022-10-31T03:13:57.000Z1047733077898739712@ilyicey u omdFalse['1586919376086704129']nl1586919376086704129[{'start': 0, 'end': 8, 'username': 'ilyicey',...0.00.00.00.0NaNNaNFalseNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
22022-10-31T03:13:24.000Z1047733077898739712@ilyicey i’m fineFalse['1586919239243296768']en1586919239243296768[{'start': 0, 'end': 8, 'username': 'ilyicey',...1.01.02.00.0NaNNaNFalseNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
32022-10-30T22:49:53.000Z1047733077898739712@imVolo_ I’ll unfollow rnFalse['1586852923706732544']en1586852923706732544[{'start': 0, 'end': 8, 'username': 'imVolo_',...0.00.03.00.0NaNNaNFalseNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
42022-10-30T22:45:33.000Z1047733077898739712“what do you want to be for halloween?” his li...False['1586851830767591424']en1586851830767591424NaN611.019.04132.055.0[{'domain': {'id': '29', 'name': 'Events [Enti...NaNFalseNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " created_at author_id \\\n", "0 2022-10-31T03:21:11.000Z 1047733077898739712 \n", "1 2022-10-31T03:13:57.000Z 1047733077898739712 \n", "2 2022-10-31T03:13:24.000Z 1047733077898739712 \n", "3 2022-10-30T22:49:53.000Z 1047733077898739712 \n", "4 2022-10-30T22:45:33.000Z 1047733077898739712 \n", "\n", " text possibly_sensitive \\\n", "0 @manjirosx you too jiro🫶🏽 False \n", "1 @ilyicey u omd False \n", "2 @ilyicey i’m fine False \n", "3 @imVolo_ I’ll unfollow rn False \n", "4 “what do you want to be for halloween?” his li... False \n", "\n", " edit_history_tweet_ids lang id \\\n", "0 ['1586921195059834880'] en 1586921195059834880 \n", "1 ['1586919376086704129'] nl 1586919376086704129 \n", "2 ['1586919239243296768'] en 1586919239243296768 \n", "3 ['1586852923706732544'] en 1586852923706732544 \n", "4 ['1586851830767591424'] en 1586851830767591424 \n", "\n", " mentions retweet_count \\\n", "0 [{'start': 0, 'end': 10, 'username': 'manjiros... 0.0 \n", "1 [{'start': 0, 'end': 8, 'username': 'ilyicey',... 0.0 \n", "2 [{'start': 0, 'end': 8, 'username': 'ilyicey',... 1.0 \n", "3 [{'start': 0, 'end': 8, 'username': 'imVolo_',... 0.0 \n", "4 NaN 611.0 \n", "\n", " reply_count like_count quote_count \\\n", "0 0.0 1.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 1.0 2.0 0.0 \n", "3 0.0 3.0 0.0 \n", "4 19.0 4132.0 55.0 \n", "\n", " context_annotations urls has_media \\\n", "0 NaN NaN False \n", "1 NaN NaN False \n", "2 NaN NaN False \n", "3 NaN NaN False \n", "4 [{'domain': {'id': '29', 'name': 'Events [Enti... NaN False \n", "\n", " annotations hashtags attachments.poll_ids withheld.copyright \\\n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN \n", "\n", " withheld.country_codes withheld.scope cashtags geo.place_id \\\n", "0 NaN NaN NaN NaN \n", "1 NaN NaN NaN NaN \n", "2 NaN NaN NaN NaN \n", "3 NaN NaN NaN NaN \n", "4 NaN NaN NaN NaN \n", "\n", " geo.coordinates.type geo.coordinates.coordinates \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import tweets first\n", "viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes, escapechar='\\\\', encoding='utf-8')\n", "# viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes)\n", "viral_tweets_df.head()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'RT @strbrkrr: apple be like \"high volume may damage your ears...\" ok… i don’t care'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "viral_tweets_df[~viral_tweets_df.annotations.isna()].text.iloc[10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 - Retrieve viral tweets users" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We start by retrieving the viral tweets users. Users are **included as expansions** when retrieving the tweets, conveniently so. For each user, we retrieve this user's history and information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Retrieve the user id. The user data is included in the 'includes' field which we get by if we retrieve any expansions\n", "users_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/users.csv\", dtype={\"id\": str, \"pinned_tweet_id\": str}, escapechar=\"\\\\\")\n", "users_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "'''\n", "id object\n", "edit_history_tweet_ids object\n", "author_id object\n", "created_at object\n", "possibly_sensitive bool\n", "text object\n", "retweet_count int64\n", "reply_count int64\n", "like_count int64\n", "quote_count int64\n", "has_media bool\n", "urls object\n", "context_annotations object\n", "annotations object\n", "hashtags object\n", "geo.place_id object\n", "mentions object\n", "dtype: object\n", "'''\n", "viral_tweets_df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2 - Analysis of single user" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's observe the tweets of single user who has tweeted viral tweets. We'll try to conduct some analysis on their features to try and see what changed in the tweets of the user over time, and how they reflect the changes in the behaviour of the user." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Take first user\n", "user_id = users_df.iloc[0].id" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "user_tweets = viral_tweets_df[viral_tweets_df.author_id == user_id]\n", "user_tweets['created_at'] = pd.to_datetime(user_tweets.created_at)\n", "user_tweets.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(1, 2, figsize=(10,5))\n", "\n", "ax[0].set_title(\"Retweet Count vs Tweet Date\")\n", "sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0])\n", "\n", "ax[1].set_title(\"Like Count vs Tweet Date\")\n", "sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[1])\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(1, 2, figsize=(10,5))\n", "\n", "user_tweets['tweet_length'] = user_tweets['text'].apply(len)\n", "\n", "ax[0].set_title(\"Retweet Count vs Tweet Length\")\n", "sns.lineplot(user_tweets, x='tweet_length', y='retweet_count', ax=ax[0])\n", "\n", "ax[1].set_title(\"Like Count vs Tweet Length\")\n", "sns.lineplot(user_tweets, x='tweet_length', y='like_count', ax=ax[1])\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Has media\n", "sns.jointplot(user_tweets, x='has_media', y='retweet_count')\n", "\n", "plt.suptitle(\"# Retweets vs Tweet has media\")\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(user_tweets[['tweet_length', 'has_media', 'retweet_count', 'like_count']])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(2, 2, figsize=(10,5))\n", "\n", "user_tweets['tweet_length'] = user_tweets['text'].apply(len)\n", "\n", "ax[0][0].set_title(\"Retweet Count vs Date\")\n", "sns.lineplot(user_tweets, x='created_at', y='retweet_count', ax=ax[0][0])\n", "\n", "ax[0][1].set_title(\"Like Count vs Date\")\n", "sns.lineplot(user_tweets, x='created_at', y='like_count', ax=ax[0][1])\n", "\n", "ax[1][0].set_title(\"Has Media vs Date\")\n", "sns.scatterplot(user_tweets, x='created_at', y='has_media', ax=ax[1][0])\n", "\n", "ax[1][1].set_title(\"Tweet Length vs Date\")\n", "sns.scatterplot(user_tweets, x='created_at', y='tweet_length', ax=ax[1][1])\n", "\n", "plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### TODO: Analyze the change in tweet features depending on date (one row depending on date, other depending on retweet count to reflect the evolution)\n", "### TODO: Concentration on topics [group by topics for a sample user]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3 - Aggregate Analysis of all viral users tweets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.0 - How many tweets per user retrieved" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tweets_per_user = viral_tweets_df.groupby(by='author_id').size().reset_index(name='count')\n", "tweets_per_user.sort_values(by='count')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tweets_per_user.hist(column='count', bins=10)\n", "plt.title(\"Histogram of distribution of number of tweets retrieved per user\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.1 - Retweet count vs like count" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to come up with a metric for the **virality** of the tweet, we need to know which features we will use to determine this metric. *retweet_count* and *like_count* will surely be among those features selected. Let's how the two correlate." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE**: \"The retweet will not show the likes and replies, only retweet count. You need to get the counts from the original tweet, which would be referenced in referenced_tweets and included in includes.tweets part of the response.\" - Twitter Community" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Remove all tweets that might be retweets of others\n", "retweeted = viral_tweets_df.retweet_count !=0\n", "liked = viral_tweets_df.like_count !=0\n", "original_tweets_df = viral_tweets_df[retweeted & liked]\n", "\n", "# Remove NA in retweet and like count\n", "original_tweets_df = original_tweets_df.dropna(axis=0, subset=['retweet_count', 'like_count'])\n", "\n", "sns.scatterplot(data=original_tweets_df, x='retweet_count', y='like_count')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finding**: We can see more or less a linear correlation. Especially for lower numbers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.2 - (# Retweets / # followers ) ratio \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here a viable metric for a viral tweet can be the ratio between the retweets (or like) count over the followers count of the user. The idea here is that a user who doesn't have many followers, but has tweeted tweets that have garnered a lot of retweets or likes, can most definitely be considered \"viral\". On the other hand, a user who has many followers can have a standard high # retweets and those cannot be considered viral all the time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: Also note that historical data for the evolution of the # of followers of a user are not easily available and are not provided by the Twitter API. So these calculated ratios do not reflect the actual ratio when the tweet has been tweeted by a user, since by then he may have gained a lot of followers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "viral_tweets_df_subset = original_tweets_df[['id', 'author_id', 'retweet_count', 'like_count']]\n", "\n", "# Remove NA in follower count\n", "users_df_subset = users_df.dropna(axis=0, subset=['followers_count'])\n", "\n", "# Merge both on author id\n", "tweets_users_merged_df = viral_tweets_df_subset.merge(\n", " right=users_df_subset[['id', 'followers_count']].set_index('id'), left_on='author_id', right_on='id')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tweets_users_merged_df['retweets_followers_ratio'] = tweets_users_merged_df['retweet_count'] / tweets_users_merged_df['followers_count']\n", "tweets_users_merged_df.sort_values(by='retweets_followers_ratio')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import plotly.express as px\n", "\n", "df_ratios_bigger_than_1 = tweets_users_merged_df[tweets_users_merged_df.retweets_followers_ratio > 1.0]\n", "fig = px.histogram(\n", " df_ratios_bigger_than_1,\n", " x=\"retweets_followers_ratio\",\n", " nbins=10,\n", " log_y=True)\n", "\n", "fig.update_layout(\n", " title={\n", " 'text': \"Histogram of the distribution of the retweets/followers ratio > 1\",\n", " 'y':0.9,\n", " 'x':0.5,\n", " 'xanchor': 'center',\n", " 'yanchor': 'top'})\n", "\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The histogram is not very clear, since we have rare events where the tweets garnered so much popularity wrt the popularity of the user. Those we can definitely consider as viral Maybe we can try K-means to better identify these outliers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "\n", "n_clusters = 3\n", "X = np.array(df_ratios_bigger_than_1[['retweet_count', 'followers_count']])\n", "#X = np.vstack((df_ratios_bigger_than_1.retweet_count.to_numpy(), df_ratios_bigger_than_1.followers_count.to_numpy()))\n", "#X = df_ratios_bigger_than_1.retweets_followers_ratio.to_numpy().reshape(-1, 1)\n", "ratio_kmeans = KMeans(n_clusters=n_clusters, random_state=123).fit(X)\n", "\n", "#np.vstack((X[:, 0], X[:, 1], ratio_kmeans.labels_)).reshape(-1, 3)\n", "#px.scatter(ratio_kmeans, x=)\n", "'''\n", "plt.title(f'K-Means clustering of #retweets/#followers ratio with k={n_clusters}')\n", "plt.xlabel('Retweets')\n", "plt.ylabel('Followers')\n", "plt.scatter(X[:, 0], X[:, 1], c=ratio_kmeans.labels_)\n", "'''" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans_results_df = pd.DataFrame(X, columns=['retweet_count', 'follower_count']) \n", "kmeans_results_df['label'] = ratio_kmeans.labels_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "px.scatter(kmeans_results_df, x='follower_count', y='retweet_count', color='label')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.3 - Metric (# Retweets / avg #retweets of a user)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').agg({'retweet_count': ['min', 'mean', 'max'], 'like_count': ['min', 'mean', 'max']})\n", "avg_nb_retweets_per_user = viral_tweets_df_subset.groupby(by='author_id').retweet_count.agg(['min', 'mean', 'max'])\n", "avg_nb_retweets_per_user" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ratio_retweet_avg_retweets_df = viral_tweets_df_subset.merge(avg_nb_retweets_per_user, on='author_id')\n", "ratio_retweet_avg_retweets_df['per_user_performance'] = ratio_retweet_avg_retweets_df['retweet_count'] / ratio_retweet_avg_retweets_df['mean']\n", "ratio_retweet_avg_retweets_df" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bigger_than_mean = ratio_retweet_avg_retweets_df[ratio_retweet_avg_retweets_df.per_user_performance > 1]\n", "hist = px.histogram(bigger_than_mean, x='per_user_performance', log_y=True)\n", "\n", "hist.update_layout(title_text=\"Distribution of tweet performance wrt avg #retweets per user\", xaxis_title=\"Tweet performance\", yaxis_title=\"log count\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Finding**: We established another metric by which we can judge the virality of a tweet, namely the number of retweets vs the average number of retweets per user. We can set a threshold (e.g. > 2) to decide whether a tweet is viral or not. We can also conduct further analysis over those tweets to determine what sets them apart from the others." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.4 - Tweet Topic (context annotations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What topics are available? Context annotations are Twitter's version of analyzing the topic of a tweet. They are defined as a context **domain** and **entity**. The domain is like a general topic and entity is like a subtopic or a specific topic within the general domain." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json \n", "\n", "tweets_with_topics = original_tweets_df.dropna(axis=0, subset='context_annotations')\n", "\n", "def topic_to_json(x):\n", " try:\n", " return json.loads(x.replace('\\'', '\"'))\n", " except json.JSONDecodeError:\n", " print(\"Nope\")\n", " return []" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO tomorrow:\n", "- Try sample and make it work with context annotations.\n", "- Check if has media is not null\n", "- hashtags extract tags\n", "- Extract context annotations\n", "- Use Celia Bearer Token" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from tweepy import Paginator, TooManyRequests\n", "client = twitter_client_wrapper.client\n", "#tweet_data = twitter_client_wrapper.client.get_users_tweets(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS, exclude='retweets')\n", "\n", "viral_users_tweets = []\n", "# Number of users processed so far\n", "try:\n", " for tweet in Paginator(client.get_users_tweets, id='1482846121517096961', tweet_fields=TWEET_FIELDS, exclude=\"retweets\").flatten(limit=20):\n", " viral_users_tweets.append(tweet.data)\n", "except TooManyRequests:\n", " print(\"Hit Rate Limit\")\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domains = {}\n", "entities = {}\n", "for tweet in viral_users_tweets:\n", " context_annotations = tweet.get('context_annotations', [])\n", " tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])\n", " domains.update(tweet_topic_domains)\n", " tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])\n", " entities.update(tweet_topic_entities)\n", " tweet['topic_domain'] = list(tweet_topic_domains.keys())\n", " tweet['topic_entity'] = list(tweet_topic_entities.keys())\n", " tweet.pop('context_annotations', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "with open('topic_domains.pickle', 'wb') as handle:\n", " pickle.dump(entities, handle, protocol=pickle.HIGHEST_PROTOCOL)\n", "\n", "with open('topic_domains.pickle', 'rb') as handle:\n", " b = pickle.load(handle)\n", "\n", "b" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "try:\n", " with open('topic_domains.pickle', 'rb') as handle:\n", " topic_domains = pickle.load(handle)\n", "except FileNotFoundError:\n", " topic_domains = {}\n", "\n", "topic_domains" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp = pd.json_normalize(viral_users_tweets)\n", "#temp[temp.context_annotations.notna()]\n", "temp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domains" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "s = pd.Series([b[item]['name'] for items in temp.topic_domain.values for item in items])\n", "s.groupby(s).count().sort_values()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "viral_users_tweets_2 = []\n", "# Number of users processed so far\n", "try:\n", " for tweet in Paginator(client.get_users_tweets, id='848263392943058944', tweet_fields=TWEET_FIELDS, exclude=\"retweets\").flatten(limit=100):\n", " viral_users_tweets_2.append(tweet.data)\n", "except TooManyRequests:\n", " print(\"Hit Rate Limit\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "domains = {}\n", "entities = {}\n", "for tweet in viral_users_tweets_2:\n", " context_annotations = tweet.get('context_annotations', [])\n", " tweet_topic_domains = dict([(topic['domain']['id'], topic['domain']) for topic in context_annotations])\n", " domains.update(tweet_topic_domains)\n", " tweet_topic_entities = dict([(topic['entity']['id'], topic['entity']) for topic in context_annotations])\n", " entities.update(tweet_topic_entities)\n", " tweet['topic_domain'] = list(tweet_topic_domains.keys()) if len(tweet_topic_domains.keys()) > 0 else pd.NA\n", " tweet['topic_entity'] = list(tweet_topic_entities.keys()) if len(tweet_topic_entities.keys()) > 0 else pd.NA\n", " #tweet.pop('context_annotations', None)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp2_df = pd.json_normalize(viral_users_tweets_2)\n", "first_context = temp2_df[~temp2_df.topic_domain.isna()].topic_domain.iloc[2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp2_df[~temp2_df['entities.hashtags'].isna()]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "temp2_df.to_csv(\"temp.csv\", index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import ast\n", "\n", "temp2_read = pd.read_csv('temp.csv', converters={'context_annotations': lambda x: eval(x) if (x and len(x) > 0) else np.nan})\n", "first_context = temp2_read[~temp2_read.context_annotations.isna()].context_annotations.iloc[2]\n", "first_context" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "eval(first_context)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def format_context_annotations(context_annotations):\n", " if (pd.isna(context_annotations)):\n", " return []\n", " else:\n", " return json.loads(context_annotations)\n", "\n", "temp2_df.context_annotations.apply(format_context_annotations)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame(viral_users_tweets_2, columns=TWEET_FIELDS).to_csv('temp_2.csv', index=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#tweet_data = twitter_client_wrapper.client.get_tweet(id='1584975692126900225', expansions=EXPANSIONS, user_fields=USER_FIELDS, tweet_fields=TWEET_FIELDS, media_fields=MEDIA_FIELDS)\n", "bytes(tweets_with_topics.iloc[1000].context_annotations, encoding='utf-8').decode('unicode_escape')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'46'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dtypes={\"id\": str, \"author_id\": str, \"has_media\": bool, \"possibly_sensitive\": bool, \"has_hashtags\": bool}\n", "temp3 = pd.read_csv(\"145371604-to-146944733.csv\", dtype=dtypes)\n", "d = temp3[~temp3.topic_domains.isna()].topic_domains.iloc[0]\n", "eval(d)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.5 - Tweet Sentiment" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.6 - Possibly sensitive" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.7 - Hashtags" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# TODO: has hashtags (using entities.hashtags)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 3.8 - Text preprocessing" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "TODO:\n", "- Sort by tweet date (check popularity)\n", "- Use Twitter lists to try and find\n", "- Check if reply or retweet" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.11 ('ada')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "orig_nbformat": 4, "vscode": { "interpreter": { "hash": "71d2f77bccee14ca7852d7b7a1fa8ea4708b81087104d93973081337557f0ee6" } } }, "nbformat": 4, "nbformat_minor": 2 }