Individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive biological processes presents a substantial statistical challenge. Regression sequentially to test multiplicative interaction terms is intractable for high-order interactions in genome-scale data. Building on fundamental principles of data science – predictability, computability, and stability – we developed the iterative random forests (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order, rule-based interactions with the same order of computational cost as the RF. We demonstrate the utility of iRF in two prediction problems: enhancer activity in the Drosophila embryo and red hair in the UK Biobank cohort. In the UK biobank cohort, we show both previously reported and novel interactions associated with hair color that represent forms of non-linearities not captured by logistic regression models. By decoupling the order of interactions from the computational cost of identification, iRF opens additional avenues of inquiry into the molecular mechanisms underlying genome biology.
Speaker: Karl Kumbier, PhD, Postdoctoral Researcher, UCSF
Register: http://eepurl.com/g1X35P