Data and ML for Data Prep for ML: The ML Data Prep Zoo
Arun Kumar (UCSD)
There is growing demand in the enterprise, Web, sciences, healthcare, and other domains for tools that make it easier to adopt machine learning (ML) for data analytics. In response, "ML platforms," including automated ML (AutoML) platforms, have emerged to systematize and even automate the whole ML workflow. Examples include SalesForce's Einstein, Google's Tensorflow Extended, and Amazon's AutoGluon. Such tools obviate or reduce manual grunt work for data preparation (prep), feature engineering, and/or model building. While the ML world has long studied the last two stages, the first stage is ill understood. No objective data exists on just how "good" such platforms are at automating data prep!
I present our vision of scientifically benchmarking ML platforms on the quality of their automated data prep. We systematize and formalize many common data prep tasks and create benchmark labeled datasets to enable objective comparisons for the first time ever. Our datasets also enable a new approach to automate data prep: use ML itself instead of ad hoc rule-based heuristics used by such platforms today. As a detailed case study, I will discuss our work on a key data prep task on tabular data: inferring ML feature types. Our ML-based approach offers much higher accuracy than state-of-the-art ML platforms. All of our datasets and models are released on a public "zoo," which also hosts a competition with leaderboards to invite community contributions. I will conclude by speculating on what other parts of ML platforms can benefit from using ML themselves.
Project Webpage: https://adalabucsd.github.io/sortinghat.html
1) Main vision paper: https://adalabucsd.github.io/papers/2019_DataPrepZoo_DEEM.pdf
2) Technical report for an under-submission paper on feature type inference: https://adalabucsd.github.io/papers/TR_2020_SortingHat.pdf. Please do not repost this on a public webpage. I am fine with the PDF being circulated among your class.
Bio: Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering and the Halicioglu Data Science Institute at the University of California, San Diego. He is a member of the Database Lab and Center for Networked Systems and an affiliate member of the AI Group. His primary research interests are in data management and systems for machine learning/artificial intelligence-based data analytics. Systems and ideas based on his research have been released as part of the Apache MADlib open-source library, shipped as part of products from Cloudera, IBM, Oracle, and Pivotal, and used internally by Facebook, Google, LogicBlox, Microsoft, and other companies. He is a recipient of two SIGMOD research paper awards, a SIGMOD Research Highlight Award, three distinguished reviewer awards from SIGMOD/VLDB, the PhD dissertation award from UW-Madison CS, an NSF CAREER Award, a Hellman Fellowship, a UCSD oSTEM Faculty of the Year Award, and research award gifts from Google, Oracle, and VMware.