Large-scale Machine Learning and Statistical Analysis of Dark Matter Halos Using Apache Spark
As part of the Scaling Up! Really Big Data class, we examined a number of machine learning algorithms in a variety of languages to explore large-scale cosmological data in a manner which has not previously been done. Starting with 2 terabytes of halo catalog data, we built a pipeline in the SoftLayer cloud to preprocess this data to get it ready for statistical and machine learning analysis. The scalable pipeline we built should be capable of streaming data from a new simulation run as it is generated into the preprocessor in order to capture the data time step by time step. In addition, this data can be enhanced with observational data to do further statistical correlations and analysis. In this paper, we present the methodology of how the pipeline was created, the tools used, provide preliminary results and links to our code so others may reproduce our work and to allow for more in-depth future analysis.