How to Scrape Tweets of Twitter Users (Handles) and Get Around Twitter API Limits

Objective: Get entire history of tweets for particular Twitter users using their handles. Avoid any limits put in place by the Twitter API.

In the previous step I compiled a list of all the Twitter handles from colleges in our sample on Twitter. The next step is to obtain all tweets created by each handle. This is tricky because the Twitter API limits the number of Tweets you can pull. Thankfully, the developers of the twint Python package have figured out how to circumvent these limits by not using the API at all!

All tweets are output to a .json file for each college's main handle and admissions handle.

Set Up

In [1]:
import pandas as pd
import numpy as np
import os, requests, re, time
In [2]:
import twint

File Locations

In [3]:
base_path = r"C:\Users\laure\Dropbox\!research\20181026_ihe_diversity"
tw_path = os.path.join(base_path,'data','twitter_handles')

Import handles

In [ ]:
ihe_handles = pd.read_pickle(os.path.join(tw_path, "tw_df_final"))
In [9]:
ihe_handles[ihe_handles.adm_handle == ''].head(2)
Out[9]:
instnm main_handle adm_handle
unitid
154022 Ashford University AshfordU
133951 Florida International University fiu

Scrape All Tweets from Username with Twint

In [5]:
def scrape_tweets(username, csv_name):
    # Configure
    c = twint.Config()
    c.Username = username
    c.Custom = ['id', 'date', 'time', 'timezone', 'user_id', 'username', 'tweet', 'replies', 
                'retweets', 'likes', 'hashtags', 'link', 'retweet', 'user_rt', 'mentions']
    c.Store_csv = True
    c.Output = os.path.join(tw_path, csv_name)
    
    # Start search
    twint.run.Profile(c)
In [ ]:
for index, row in ihe_handles.iterrows():
    
    # CSV file names
    main_f_name = os.path.join(tw_path, str(index) + '_main.csv')
    adm_f_name = os.path.join(tw_path, str(index) + '_adm.csv')
    
    # Handles
    main_handle = row['main_handle']
    adm_handle = row['adm_handle']
    
    if main_handle != '':
        scrape_tweets(main_handle, main_f_name)
    
    if adm_handle != '':
        scrape_tweets(adm_handle, adm_f_name)