birdspotter: A toolkit for analyzing and labelling Twitter users

Mar 9, 2021 Software, Post

Social media platforms, although relatively new, host millions of users and billions of interactions daily. As tied as we are to these platforms, they profoundly impact our social institutions through phenomena such as disinformation, political polarization, and social bots.

Researchers are increasingly interested in trying to form an understanding of phenomena and their implications. Social scientists, political scientists, and data practitioners alike curate expansive datasets to combat these potentially adverse effects on our society; however, they lack the appropriate tooling.

birdspotter is an easy-to-use tool that models Twitter users’ attributes and labels them. It comes prepackaged with a state-of-the-art bot detector and an influence quantification system based on tweet dynamics. birdspotter features a generalized user labeler, which can be retrained easily with the engineered features to address a variety of use cases. Also, birdspotter.ml is a web application that can be utilized to explore datasets and derive a narrative around a dataset.

In this post, I'll showcase the basic usage of birdspotter and birdspotter.ml.

Installation

The package can be installed in the canonical python way:

pip install birdspotter

Getting a dataset

The Twitter T&Cs restrict the sharing of tweet data directly online; however, they do allow the sharing of tweet-ids, which can be converted to full tweet data through a process called hydration. Tools like twarc can be used to hydrate a Tweet ID dataset. The resulting dataset will be in jsonl (line delimited json) format, which birdspotter accepts directly.

In the below examples, we use two datasets; a collection of COVID-19 related tweets from January 31st, 2020 [1], and a collection of tweets about politicians on Twitter [2].

The politicians’ dataset was acquired through the following process (and a similar process was taken for the COVID-19 dataset):

pip install twarc
wget http://twitterpoliticians.org./downloads/base/all_tweet_ids.csv
twarc hydrate all_tweet_ids.csv > tweets.jsonl

Basic Usage

The code below imports the main class Birdspotter, extracts the tweets from their standard format, labels the users with the default bot detector and influence, and reformats the retweet cascades into a tidier format.

## Import birdspotter
from birdspotter import BirdSpotter 

## Extracts the tweets from the raw jsonl [https://github.com/echen102/COVID-19-TweetIDs]
bs = BirdSpotter('covid19.jsonl') 

## Uses the default bot labeller and influence quantification systems
bs.getLabeledUsers() 

## Formats the retweet cascades, such that expected retweet structures can extracted
bs.getCascadesDataFrame() 

## Access the botness labels and influence scores
bs.featureDataframe[['botness', 'influence']]

From here, the dataset is readily profile-able:

botness_dist = sns.histplot(data=bs.featureDataframe, x="botness")
influence_eccdf = sns.ecdfplot(data=bs.featureDataframe, x="influence", complementary=True).set(xscale="log", yscale="log")

COVID Dataset Profile: (Left) The distribution of bot scores of users; (Right) The ECCDF of influence scores of users, showing a long-tailed (rich-gets-richer) paradigm

The visualizer

An alternative way to profile a dataset is the use birdspotter.ml, which facilitates dataset exploration and narrative construction.

birdspotter.ml visualizer: The various components shown include the scatterplot panel (Left), the user information panel (Top Right), and the retweet cascades panel (Bottom Right)

The visualizer features a scatterplot (on the left) of influence and botness for a sample of users and the population density. The colors represent the hashtags (a proxy for the topic) that the users most tweet about in the dataset. Users within the scatterplot are hoverable and selectable, and their information populates in the components on the right.

The top right component shows information and metrics about the selected user and links the user's profile.

The bottom right component shows the retweet cascades where a user has participated and highlights their participation. The points represent the follower counts (social capital) of users and their retweets/tweets’ timing. The points are also hoverable and selectable.

Customising the labeller

By default, the labeler is trained as a bot detection system, comparable to the state-of-the-art botometer. Notable, birdspotter is provided in an offline package and can be applied at scale, while botometer is accessible only via an online API, which is often prohibitively rate-limited.

birdspotter is a versatile tool and can be utilized by practitioners for a variety of use-cases. For example, we could train the labeler to identify political leaning. This process is a bit involved, so we summarise it below;

We hydrate some tweets from the Twitter Parlimentarian Database
We filter the tweets to include only Australian Politicians.
We label right-wing partied politicians positively, and others negatively (with bs_pol.getBotAnnotationTemplate for example)
We retrain birdspotter with these new labels and label all users (i.e., including users the politicians retweeted) using the new model

bs_pol = BirdSpotter('aus_tweets.jsonl')
bs_pol.trainClassifierModel('pol_training_data.pickle')
bs_pol.getLabeledUsers()

On this limited of Australian politicians dataset, a 10-fold CV of birdspotter garners an average AUC (Area under ROC) of 0.986.

Conclusion

birdspotter aims to democratize social analyzes that were once the domain of machine learning experts, generating insights and understanding of online phenomena and mitigating their potentially adverse effects on our society. This post shows how birdspotter can be used in both a simple and advanced way to recover such insights.

References

[1] Chen, E. et al. 2020. Tracking social media discourse about the covid-19 pandemic: Development of a public coronavirus twitter data set. JMIR Public Health and Surveillance. 6, 2 (2020), e19273.

[2] Vliet, L. van et al. 2020. The twitter parliamentarian database: Analyzing twitter politics across 26 countries. PloS one. 15, 9 (2020), e0237073.

python