Skip to main content

Learning from Biased and Costly Data Sources: Minimax-optimal Data Collection under a Budget

Professor Kirthi Kandasamy (Computer Sciences) at Machine Learning Lunch Meetings

Event Details

Date
Tuesday, March 24, 2026
Time
12:15-1:15 p.m.
Location
7th Floor Seminar Room, Morgridge Hall
Description

Data collection is a critical component of modern statistical and machine learning pipelines, particularly when data must be gathered from multiple heterogeneous sources to study a target population of interest. In many use cases, such as medical studies or political polling, different sources incur different sampling costs. Observations often have associated group identities—for example, health markers, demographics, or political affiliations—and the relative composition of these groups may differ substantially, both among the source populations and between sources and target population.

In this work, we study multi-source data collection under a fixed budget, focusing on the estimation of population means and group-conditional means. We show that naive data collection strategies (e.g. attempting to “match” the target distribution) or relying on standard estimators (e.g. sample mean) can be highly suboptimal. Instead, we develop a sampling plan which maximizes the effective sample size—the total sample size divided by D(q || p) + 1, where q is the target distribution, p is the aggregated source distributions, and D is the chi-squared divergence. We pair this sampling plan with a classical post-stratification estimator and upper bound its risk. We provide matching lower bounds, establishing that our approach achieves the budgeted minimax optimal risk. Our techniques also extend to prediction problems when minimizing the excess risk, providing a principled approach to multi-source learning with costly and heterogeneous data sources.

(This talk is part of the weekly Machine Learning Lunch Meetings (MLLM), held every Tuesday from 12:15 to 1:15 p.m.  Professors from Computer Sciences, Statistics, ECE, the iSchool, and other departments will discuss their latest research in machine learning, covering both theory and applications. This is a great opportunity to network with faculty and fellow researchers, learn about cutting-edge research at our university, and foster new collaborations. For the talk schedule, please visit https://sites.google.com/view/wiscmllm/home. To receive future weekly talk announcements, please subscribe to our UW Google Group at https://groups.google.com/u/1/a/g-groups.wisc.edu/g/mllm.)

Cost
Free
Accessibility

We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please call 608-334-7269 or email jerryzhu@cs.wisc.edu to make a disability-related accommodation request. Reasonable effort will be made to support your request.

Tags