PhD student in statistics
Sarah will present two projects, one completed and another starting, on the topics of interpretability and causal inference.
Risk-scoring models are common in critical domains such as credit, crime and healthcare, but are frequently proprietary or opaque. She will describe a proposed approach, transparent model distillation, to audit such black-box models. Model distillation was first introduced in machine learning to mimic deep neural network models with simpler models. The approach adds the additional requirement that the mimic model be transparent or interpretable in some sense, and uses the connection between assigned risk (by the risk scores) and actual risk (whether the outcome actually happened) to detect regions of potential differences. She will present results on four public data sets - COMPAS recidivism, Lending Club loans, Stop-and-Frisk and Chicago Police crime risk scores - and recent work on UCSF data with FRAX fracture risk scores. This is joint work with Rich Caruana, Giles Hooker, and Yin Lou. The paper is available at https://arxiv.org/abs/1710.06169
In the second part of the talk, Sarah will describe a new project to determine the impact of later school start times on health and academic outcomes in NYC public schools. The methods of interest include longitudinal causal inference models, group trajectory methods, individualized treatment rules, etc., and researchers are interested in meeting UCSF experts in these areas. This is joint work with the NYC Office of School Health.
Sarah Tan is a Statistics PhD student at Cornell University and visiting student at UCSF, hosted by Professor Charles McCulloch. She works on causal inference and interpretability of machine learning methods, particularly tree-based methods. Before graduate school she worked in the NYC Department of Health and public hospitals system. She was a 2014 Data Science for Social Good Fellow and spent summers at Microsoft Research.