Skip to main content

Statistics Seminar

Data without Borders: Game-theoretic Challenges in Democratizing Data presented by Kirthevasan Kandasamy

Event Details

Monday, April 8, 2024
4 p.m.

Abstract: Due to the popularity of machine learning, many organizations view data as an invaluable resource, likening it to the "new oil/gold". However, unlike many types of resources, data is nonrivalrous: it can be freely replicated and used by many. Hence, data produced by one organization, can, in principle, generate limitless value to many others. This will accelerate economic, social, and scientific breakthroughs and benefit society at large. However, considerations of free-riding and competition may prevent such open sharing of data between organizations. An organization may be wary that others may not be contributing a sufficient amount of data, or contributing fabricated/poisoned datasets. In some recent work, we leverage ideas from game theory and robust statistics to design protocols for data sharing. Our methods incentivize organizations to collect and truthfully contribute large amounts of data, so that socially optimal outcomes can be achieved.

In this talk, I will first present a high level view of some of our recent approaches to solving these challenges and focus on a mean estimation problem. Here, a set of strategic agents collect i.i.d samples from a normal distribution at a cost, and wish to estimate the mean of this distribution. To facilitate collaboration, we design mechanisms that incentivize agents to collect a sufficient amount of data and share it truthfully, so that they are all better off than working alone. Our approach prevents under-collection and data fabrication via two key techniques: first, when sharing the others’ data with an agent, the mechanism corrupts this dataset proportional to how much the data reported by the agent differs from the others; second, we design minimax optimal estimators for the corrupted dataset. Our mechanism, which is Nash incentive compatible and individually rational, achieves a social penalty (sum of all agents’ estimation errors and data collection costs) that is at most a factor 2 of the global minimum. We then generalize these results to high dimensional non-Gaussian mean estimation problems.