BART Density Estimator
Our mission is to increase BART ridership through accurate ridership predictions for riders and administrators.
BART ridership was decimated by COVID lockdowns and fails to recover pre-lockdown levels as other mass transit systems are tracking toward. Transit patterns are changing rapidly in the post-pandemic economic landscape, and it is crucial that BART adapt to meet the needs of the population and encourage more ridership.
Our team created an application to predict the busy-ness of a BART train on a given route at a given hour. This solution addresses one aspect of rider consideration: comfort as experienced through an available seat and space apart from other riders. Built as a web app, our solution can be easily converted to a mobile application or integrated to Google/Apple Maps with some work.
Data
We combined data from six sources to train and evaluate our model.
- BART trips. 110 million rows representing 1.1 billion trips between all stations, 2011-2022.
- Stations. IDs and locations of 50 stations.
- Weather. 12 years of daily weather from 4 stations covering the Bay Area.
- Events & holidays. Calendar items impacting all or geo-specific stations.Google trends. Search data for "BART schedule."
Data Science Approach
We used four regression models from MLlib in pySpark: Linear Regression, Decision Trees Regression, Random Forest Regression, and Gradient-boosted Trees Regression. We trained these models with hyperparameter tuning and TimeSeries-Split Cross-validation to get the best results. The best performing model was the Gradient-boosted Trees model.
Evaluation
We chose the coefficient of determination (R2) as well as the Root-Mean-Squared-Error (RMSE) to evaluate the success of our models. Our baseline model which was a Linear regression model yielded a R2 score of 0.28 and RMSE of 0.37. Our best model yielded a goodness of fit (R2) score of 0.73 and RMSE of 0.26.
Key Learnings
- Ridership motivations and obstacles are varied and complex. Rider interviews revealed intersecting concerns of gender and safety, class and autonomy, and macroeconomic environmental variables.
- Technical considerations should be made in totality. We learned that our commitment to Spark isn't natively integrated into AWS's Sagemaker platform. This complicated and slowed our end-to-end development.