Skip to main content

Talk: Efficient and Accurate Systems for Querying Unstructured Data

Daniel Kang: PhD student in the Stanford DAWN lab

Event Details

Monday, February 14, 2022
4-5 p.m.

Abstract: Over the past 60 years, structured databases have been a runaway success: they are deployed at every major organization and have produced hundreds of billions in value. However, there has been a growing demand for analytics over unstructured data (e.g., videos, audio, text) given the cheapness of sensors and the rise of ML capabilities. Unfortunately, ML can be prohibitively expensive to deploy (e.g., 10 orders of magnitude more expensive than standard structured analytics) and produce incorrect results.

In this talk, I'll describe my work on new ML-based query systems to tackle these challenges. My first line of work accelerates large classes of queries by orders of magnitude while providing strong guarantees on query accuracy. I accomplish this by developing novel query processing algorithms, indexing methods, and execution engines for unstructured data queries. I'll also describe how to find errors in human labels and ML model outputs using novel data management systems. Perhaps surprisingly, our systems discovered a large number of errors in a popular autonomous vehicle dataset and can be used to improve ML models. My research has been deployed at an autonomous vehicle company and has enabled new forms of video analytics for ecologists at the Jasper Ridge biological preserve.

Bio: Daniel Kang is a sixth year PhD student in the Stanford DAWN lab, co-advised by Professors Peter Bailis and Matei Zaharia. His research focuses on systems to query unstructured data. In particular, he focuses on using cheap approximations to accelerate query processing algorithms and new programming models for ML data management. Daniel is collaborating with autonomous vehicle companies and ecologists to deploy his research. His work is supported in part by the NSF GRFP and the Google PhD fellowship.