Looking for a Needle in a Haystack: Predicting Wikipedia Edits
This project aims at predicting new edits on Wikipedia, a widely-used public encyclopedia. If it was possible to predict which articles were likely to be edited soon, Wikipedia could notify readers that there may soon be new information. Separately, well-predicted near-future edits could be a relevant feature to other models — those predicting, for example, vandalism or movie box office success — allowing them to identify trends slightly sooner.
Based on size, view, and edit features, our model was able to do a reasonably good job predicting edit likelihood on a down-sampled, balanced dataset. We found that a Gradient Boosting model was the most effective compared to six other classifiers that we trained. By using about 50 features drawn from both the main and talk namespaces, and based on the text's size, view count, edit count, and minor edit count, it was able to predict the probability of an article being edited with roughly 76% accuracy on a sampled dataset where half of the articles were edited. However, more work must be done for it to be reliable "in the wild," where, every day, less than 0.1% of the namespace articles get an edit.