Replicating the New York Times' Twitter bot analysis with R and Python

Yesterday's the New York Times published a very nice analysis of Twitter's bot market. Part of this analysis involves plotting the list of followers for a given account in an orderly fashion (the most recent last) on the x axis, and their creation date on the y axis. If abnormal patterns emerge, this can be an indication of foul play.

Let's take a look at this example, @marthalanefox's account:

@marthalanefox's
account

Those horizontal streaks mark massive following events carried out by bots. They stand out very clearly against the boring pattern of organic growth.

So I started thinking: how difficult can it be to replicate that kind of plot using not a lot of code, if possible? It turns out it can be fairly easy.

First of all, here's all the code. It involves basically two files:

get_data.py: obtain data using Twitter's API, save it to a local file and, at the same time, build a cache so we don't have to query the same account twice.
plot_data.R: what the name says.

Here's the code for get_data.py:

import twitter
import sys
import os
import csv
import gzip
import copy
import cPickle as pickle
from config import *

def get_follower_list(username):
    # Return a list of user_ids that followed this username. Makes as many calls 
    # to the Twitter API as necessary (returns the full list of followers, not 
    # just the results of the first returned page).

    api = twitter.Api(consumer_key = consumer_key,
                      consumer_secret = consumer_secret,
                      access_token_key = access_token_key,
                      access_token_secret = access_token_secret,
                      sleep_on_rate_limit = True)

    return(api.GetFollowerIDs(screen_name = username))

def get_follower_info(user_id_list):
    # Return the account creation time for every user_id in user_id_list. The 
    # return format is a list of lists: [[user_id1, date1, n_tweets1], 
    # [user_id2, date2, n_tweets2], ...].

    api = twitter.Api(consumer_key = consumer_key,
                      consumer_secret = consumer_secret,
                      access_token_key = access_token_key,
                      access_token_secret = access_token_secret,
                      sleep_on_rate_limit = True)

    # Need to do it manually in chunks of 100 users, the API doesn't split the 
    # list. See this issue: https://github.com/bear/python-twitter/issues/523
    csize = 100
    res = []
    user_id_list_chunks = [user_id_list[i:i+csize] \
                           for i in range(0, len(user_id_list), csize)]

    for chunk in user_id_list_chunks:
        partial = api.UsersLookup(chunk)
        partial = [[user.id, copy.copy(user.created_at), user.statuses_count] for user in partial]
        res += partial

    return(res)


if __name__ == "__main__":

    username = sys.argv[1]
    print "Will get followers for {}".format(username)

    flist = get_follower_list(username)
    print "Obtained {} followers".format(len(flist))

    # Use cache for this. Many users will have repeated followers, no need to 
    # keep getting information for those, just use a local dictionary (and save 
    # it every time).
    if os.path.isfile("users_cache.pickle"):
        with open("users_cache.pickle", "r") as f_in:
            cache = pickle.load(f_in)
    else:
        cache = dict()

    need_info_list = [str(user_id) for user_id in flist if user_id not in cache]
    if len(need_info_list) > 10: # put a lower limit on this
        print "Need to get information from {} followers, rest in cache"\
                .format(len(need_info_list))

        new_users = get_follower_info(need_info_list)

        for new_user in new_users:
            cache[new_user[0]] = (new_user[1], new_user[2])
    else:
        print "Have all info in cache already"

    # Save dictionary after it's been updated
    with open("users_cache.pickle", "wb") as f_out:
        pickle.dump(cache, f_out)

    # Now simply save the CSV file for the requested user_name. Don't forget 
    # that the original `flist` array is ordered (the most recent follower comes 
    # first, so we can use this to add another column with the proper ordering).
    # Also, apparently some user_ids are not returned. Remove them from the 
    # original list.
    flist = [user_id for user_id in flist if user_id in cache]
    n_order = len(flist)
    with gzip.open("{}.csv.gz".format(username), "wb") as f_out:
        writer = csv.writer(f_out)
        writer.writerow(['id', 'created_at', 'statuses_count', 'order'])
        for user_id in flist:
            dd = cache[user_id]
            writer.writerow([user_id, dd[0], dd[1], n_order])
            n_order -= 1

You'll see it's fairly simple. It's only two main functions: get_follower_list is just a single call the API wrapper that will return a list with all user_ids following a certain account. get_follower_info will receive that list as an input and, for every user_id, will return also the creation date and the number of tweets (might be useful for some other thing, but can be ignored for this).

Data gathering is a very slow process due to Twitter's API limitations. As I knew I would end up inspecting several different accounts, I build a very simple caching system on the main part of the code, so every time we obtain the list of followers for a given account, we store the relevant information in a local dictionary. If there is overlap between the followers of two accounts (and there will be), we only have to obtain the creation dates for the ones we don't already have.

The plotting code is even simpler:

library(tidyverse)
theme_set(theme_bw(12))
library(lubridate)

username <- 'marthalanefox'

dd <- read_csv(sprintf("%s.csv.gz", username))

Sys.setlocale("LC_TIME", "C")
dd$created_at <- parse_date_time(dd$created_at,
                                 orders = c("%a %b %d %H:%M:%S %z %Y"))

plt1 <- ggplot(dd) +
    geom_point(aes(x = order, y = created_at), 
               color = "blue", alpha = 0.1, size = 0.1) +
    xlab(sprintf("@%s's followers", username)) +
    ylab("Join date") +
    scale_y_datetime(date_breaks = "1 year", date_labels = "%Y")
plot(plt1)

That's it! We basically load the data, parse the date (I could have done that on the Python part, but I forgot to do it) and call the wonderful ggplot's magic. Here's how it looks:

R and Python
version

Funnily enough, it looks like there is another bot wave joining the ranks right after the date when NYT's analysis stopped gathering data.

There is no comment system. If you want to tell me something about this article, you can do so via e-mail or Mastodon.