Distinguished Lecture: Data Balancing and Multimodal AI
Zaid Harchaoui: Professor, University of Washington in Seattle, Department of Statistics, Allen School of Computer Science and Engineering
Event Details
LIVE STREAM: https://uwmadison.zoom.us/j/99279325658?pwd=THwBVQXamRc8FrOzo2ETbFsayFiNuC.1
Abstract: Data balancing across multiple modalities and sources occurs in various forms in many foundation models, including OpenAI’s CLIP and Meta’s DINO. The latter models yield versatile feature representations across numerous domains. We show that data balancing enjoys an unsuspected benefit: reducing the variance of estimators defined as functionals of the empirical distribution over these sources. After describing data balancing as alternating information projections, we will present non-asymptotic statistical bounds quantifying this variance reduction effect. We will discuss how the amount of variance reduction of data balancing can be characterized by eigen-decays of appropriately defined Markov operators and compare it to natural baselines. We will end with illustrations in contrastive multimodal learning and self-supervised clustering.
This is based on joint work with Lang Liu, Ronak Mehta, Soumik Pal. Preprint https://arxiv.org/abs/2408.15065.
Biography: Zaid Harchaoui is a Professor at the University of Washington in Seattle, in the Department of Statistics and in the Allen School of Computer Science and Engineering, and a Senior Data Science Fellow in the eScience Institute. He is an action editor at the Journal of Machine Learning Research, and an associate editor at the Journal of the Royal Statistical Society - Series B. He is a principal investigator and a cofounder of IFML, the NSF-AI Institute on Foundations of Machine Learning, and of IFDS, the NSF-TRIPODS institute on foundations of data science. He obtained the doctoral degree from Telecom Paris - Institut Polytechnique de Paris, for his research performed at CNRS - the French National Institute for Fundamental Research. He previously held appointments at the Courant Institute of Mathematical Sciences at New York University, and at INRIA - the French National Institute for Research in Digital Science and Technology.