MIDS Capstone Project Spring 2019

Rumor Detection On Twitter

Visit our marketing site for further explanation: https://rumordetect.home.blog

Our platform at www.rumordetect.com allows people to sign in with twitter, search topics from others, or submit detection tests. You are able to tag @RumorDetect on any tweet from twitter and our platform will analyze it, outputting the confidence score and displaying the tweet information.

Data Pipeline

https://rumordetect.home.blog/data-pipeline/

Our webapp is built using ruby on rails, which is a framework that combines postgres database, html, css, jquery, delayed jobs, in a classic model view controller relationship. The webapp passes searchable tweets to an S3 bucket stored in raw json. From here our prediction pipeline takes this tweet and finds the below propagation features using twitters api, storing the results in an S3 bucket. From here our pretrained model forms that data and makes prediction storing that result in an S3 bucket where the webapp is polling for a result on that tweet, to show to the end user.

Feature Engineering

https://rumordetect.home.blog/feature-engineering/

Our method relies on network propagation dynamics determined by retweet attribution.

Our tweet propagation time series data set is created in two steps:

  1. Build individual tweet data set describing the original tweet with the claim being evaluated and each of its retweets.
  2. Build a time series by aggregating all individual tweets up to the hour being sampled and adding propagation features.

At each sampling point (t) we build our propagation time series with:

  1. aggregates of tweet level features across the tweets thus far,
  2. values of these features for the most recent tweet,
  3. differences between current and past values (e.g. number of new tweets),
  4. diffusion tree shape summaries (e.g. total number of parents)

Model

https://rumordetect.home.blog/models/

Current attempts at real-time rumor detection have leveraged both spatial and temporal features. The ones that are simple to implement have turned out to have low predictive power and the ones that work are typically too complicated to implement for practical use.

Models

Model Accuracy Precision Recall F1 Score
HMM 0.46 - - -
LSTM 0.50 - - -

Rand. For

0.75 0.71 0.86 0.77
Linear Reg. 0.80 0.79 0.80 0.79
Grad. Boost Reg. 0.82 0.80 0.86 0.83
Last updated: October 1, 2019