Research Seminar in Exploratory Data Analysis
Info
290
2 units
Course Description
Students in this course will expand on their knowledge of techniques for exploratory data analysis (EDA) and collaborate on and contribute to a research project whose goal is to create a new framework for the EDA process.
Topics and goals overview:
Exploratory data analysis is an approach to examining data that emphasizes visually describing and interactively and iteratively inspecting data. EDA is the first step in data analysis, prior to performing confirmatory statistical analysis (such as conducting statistical tests or fitting statistical models; this topic is taught in Info 271B Quantitative Research Methods, which is a terrific complement to this course). This distinction between exploratory and confirmatory statistics was originally championed by mathematical pioneer John Tukey, who said of EDA, “1. It is an attitude, and 2. a flexibility, and 3. some graph paper.”
Exploratory data analysis should be conducted before other types of analysis, in order to:
- evaluate data quality and identify additional data to collect, if necessary,
- suggest questions and hypotheses to pursue, or
- assess assumptions on which later analysis will be based.
Exploratory data analysis techniques include:
- visualization techniques (histograms, scatter plots, parallel coordinates, etc.)
- projection methods (principal component analysis, multidimensional scaling, projection pursuit, t-SNE, etc.)
- unsupervised machine learning (clustering, pattern mining, anomaly detection, etc.)
One challenge is that while there are a multitude of tools for data exploration, there is no established systematic understanding of or rules for guiding such exploration. Instead, data analysts learn how to do this work by slow trial and error or in an apprentice model from other analysts. Therefore, guidelines are needed to allow measurement of the amount of progress made in exploring a data set, to ensure complete coverage in exploration, to allow different sets of people to collaborate in exploring a data set individually and later combine their results, and to develop automated and intelligent assistance algorithms for data analysis interfaces.
Students in this course will expand their knowledge of and practice with exploratory data analysis techniques and at the same time will develop a repository of EDA case studies to be used to further our understanding of the EDA process. The first part of the course will consist of developing data sets and scenarios of use that can be used as examples, both for instruction and research, of best practices for EDA. The last few weeks of the course will be to help convert those examples into a systematic framework or theoretical model that characterizes the EDA process or processes, in order to guide future practice as well as to inform the design of new interactive data analysis tools.
Students will be expected to work together in teams and with the instructor to reach these goals. This is a research seminar, therefore students must be comfortable with open-ended problems, self-directed work and with setting their own goals.
The primary EDA tool used will be Tableau, but other programming abilities will be needed, e.g., for parsing and analyzing the Tableau log files, for wrangling data sets to get them into the right format, and so on. Students who are interested in the more analytic side of EDA (projection methods, clustering, etc) and who already have background in this area will be allowed to work on these problems, but must come to the course with strengths in those methods, as they will not be the focus of classroom work.
Requirements:
Course is open to graduate students from all fields, at discretion of the instructor. Students should have taken either:
- Info 247 (Information Visualization and Presentation), or
- CS 294-10 (Information Visualization)
Students will be expected to have:
- Enjoyed the EDA aspects of their infoviz course
- Familiarity with Tableau
- Proficiency in programming and the use of software engineering tools like the unix command line, databases, version control, some scripting language (Python, MATLAB or R will be useful)
- Ability to comfortably pick up new programming languages and software tools with minimal guidance