banner_22.jpg
MIDS Capstone Project Spring 2023

BART Density Estimator

Our mission is to increase BART ridership through accurate ridership predictions for riders and administrators.

BART ridership was decimated by COVID lockdowns and fails to recover pre-lockdown levels as other mass transit systems are tracking toward. Transit patterns are changing rapidly in the post-pandemic economic landscape, and it is crucial that BART adapt to meet the needs of the population and encourage more ridership.

Our team created an application to predict the busy-ness of a BART train on a given route at a given hour. This solution addresses one aspect of rider consideration: comfort as experienced through an available seat and space apart from other riders. Built as a web app, our solution can be easily converted to a mobile application or integrated to Google/Apple Maps with some work.

Data

We combined data from six sources to train and evaluate our model.

  • BART trips. 110 million rows representing 1.1 billion trips between all stations, 2011-2022.
  • Stations. IDs and locations of 50 stations.
  • Weather. 12 years of daily weather from 4 stations covering the Bay Area.
  • Events & holidays. Calendar items impacting all or geo-specific stations.Google trends. Search data for "BART schedule."

Data Science Approach

We used four regression models from MLlib in pySpark: Linear Regression, Decision Trees Regression, Random Forest Regression, and Gradient-boosted Trees Regression. We trained these models with hyperparameter tuning and TimeSeries-Split Cross-validation to get the best results. The best performing model was the Gradient-boosted Trees model.

Evaluation

We chose the coefficient of determination (R2) as well as the Root-Mean-Squared-Error (RMSE) to evaluate the success of our models. Our baseline model which was a Linear regression model yielded a R2 score of 0.28 and RMSE of 0.37. Our best model yielded a goodness of fit (R2) score of 0.73 and RMSE of 0.26.

Key Learnings

  • Ridership motivations and obstacles are varied and complex. Rider interviews revealed intersecting concerns of gender and safety, class and autonomy, and macroeconomic environmental variables. 
  • Technical considerations should be made in totality. We learned that our commitment to Spark isn't natively integrated into AWS's Sagemaker platform. This complicated and slowed our end-to-end development. 
Last updated: April 24, 2023