{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Viral Tweets: User exploration" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will explore the users who have tweeted viral tweets. Namely, we will focus our analysis on the viral tweets from the user point of view. For example, we'll examine the popularity of the user vs the popularity of his tweets, the history of his tweets and analyze any flagrant changes in their features when they became viral, etc." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 0 - Setup" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import numpy as np\n", "\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "\n", "from tqdm import tqdm\n", "\n", "#pd.set_option('display.max_rows', None)\n", "pd.set_option('display.max_columns', None)\n", "\n", "DATA_PATH = \"../../data\"\n", "VIRAL_TWEETS_PATH = f\"{DATA_PATH}/viral_users\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from helper.twitter_client_wrapper import TwitterClientWrapper, EXPANSIONS, MEDIA_FIELDS, TWEET_FIELDS, USER_FIELDS\n", "\n", "twitter_client_wrapper = TwitterClientWrapper(\"../../api_key.yaml\", wait_on_rate_limit=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1 - Retrieve the data from disk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Retrieve the viral tweets data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note**: You may notice that all tweets have been retrieved, since some may have been deleted since scraping them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note 2**: Also keep in mind that when retrieving users, the number of users may be less because users may have two or more viral tweets in the sample of viral tweets we have. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# dtypes={\"id\": str, \"author_id\": str, \"has_media\": bool, \"possibly_sensitive\": bool}\n", "dtypes={\"id\": str, \"author_id\": str}" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\steph\\AppData\\Local\\Temp\\ipykernel_18728\\1524257405.py:2: DtypeWarning: Columns (3,8,14,17,18,19,20,21,22,23,24) have mixed types. Specify dtype option on import or set low_memory=False.\n", " viral_tweets_df = pd.read_csv(f\"{VIRAL_TWEETS_PATH}/all_tweets.csv\", dtype=dtypes, escapechar='\\\\', encoding='utf-8')\n" ] }, { "data": { "text/html": [ "
\n", " | created_at | \n", "author_id | \n", "text | \n", "possibly_sensitive | \n", "edit_history_tweet_ids | \n", "lang | \n", "id | \n", "mentions | \n", "retweet_count | \n", "reply_count | \n", "like_count | \n", "quote_count | \n", "context_annotations | \n", "urls | \n", "has_media | \n", "annotations | \n", "hashtags | \n", "attachments.poll_ids | \n", "withheld.copyright | \n", "withheld.country_codes | \n", "withheld.scope | \n", "cashtags | \n", "geo.place_id | \n", "geo.coordinates.type | \n", "geo.coordinates.coordinates | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "2022-10-31T03:21:11.000Z | \n", "1047733077898739712 | \n", "@manjirosx you too jiro🫶🏽 | \n", "False | \n", "['1586921195059834880'] | \n", "en | \n", "1586921195059834880 | \n", "[{'start': 0, 'end': 10, 'username': 'manjiros... | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
1 | \n", "2022-10-31T03:13:57.000Z | \n", "1047733077898739712 | \n", "@ilyicey u omd | \n", "False | \n", "['1586919376086704129'] | \n", "nl | \n", "1586919376086704129 | \n", "[{'start': 0, 'end': 8, 'username': 'ilyicey',... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
2 | \n", "2022-10-31T03:13:24.000Z | \n", "1047733077898739712 | \n", "@ilyicey i’m fine | \n", "False | \n", "['1586919239243296768'] | \n", "en | \n", "1586919239243296768 | \n", "[{'start': 0, 'end': 8, 'username': 'ilyicey',... | \n", "1.0 | \n", "1.0 | \n", "2.0 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
3 | \n", "2022-10-30T22:49:53.000Z | \n", "1047733077898739712 | \n", "@imVolo_ I’ll unfollow rn | \n", "False | \n", "['1586852923706732544'] | \n", "en | \n", "1586852923706732544 | \n", "[{'start': 0, 'end': 8, 'username': 'imVolo_',... | \n", "0.0 | \n", "0.0 | \n", "3.0 | \n", "0.0 | \n", "NaN | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "
4 | \n", "2022-10-30T22:45:33.000Z | \n", "1047733077898739712 | \n", "“what do you want to be for halloween?” his li... | \n", "False | \n", "['1586851830767591424'] | \n", "en | \n", "1586851830767591424 | \n", "NaN | \n", "611.0 | \n", "19.0 | \n", "4132.0 | \n", "55.0 | \n", "[{'domain': {'id': '29', 'name': 'Events [Enti... | \n", "NaN | \n", "False | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "NaN | \n", "