Speaker: Weijie Su, PhD, Associate Professor of Statistics & Data Science, University of Pennsylvania
In this talk, we advocate for the development of rigorous statistical foundations for large language models (LLMs), motivated by the probabilistic, autoregressive nature of next-token prediction and the complexity and black-box structure of Transformer architectures. We illustrate how statistical insights can directly benefit LLM development through two concrete examples. First, we demonstrate statistical inconsistencies and biases induced by current approaches to aligning LLMs with human preferences, and propose a regularization term that is both necessary and sufficient to ensure consistent alignment. Second, we introduce a novel statistical framework for analyzing the efficiency of watermarking schemes, focusing on a watermarking method developed by OpenAI, for which we derive optimal detection rules that outperform existing approaches. Collectively, these results show how statistical principles can address pressing challenges in LLMs while opening new research directions for responsible generative AI research.
The Department of Epidemiology & Biostatistics welcomes all participants to our events. If you need a reasonable accommodation to participate in this event because of a disability, please contact Liz Buggs ([email protected]).