Machine Learning Lunch Meeting
How Intelligent Are Current Multimodal Video Models?
Event Details
Everyone is invited to the weekly Machine Learning Lunch Meetings held Fridays 12:30-1:30pm. Faculty members from Computer Sciences, Statistics, ECE, and other departments will discuss their latest groundbreaking research in machine learning. This is an opportunity to network with faculty and fellow researchers, and to learn about the cutting-edge research being conducted at our university. Please see our website for more information.
Speaker: Yong Jae Lee (CS)
Abstract: In this talk, I will present two recent contributions from my lab that challenge and advance the capabilities of multimodal video models. First, I will introduce a new benchmark called Vinoground, which evaluates the temporal, counterfactual reasoning capabilities of existing models. Spoiler alert: they aren't great (to put it mildly). Second, I will present a novel approach inspired by the Matryoshka doll to improve the efficiency of multimodal models. It learns to compress the total number of visual tokens in a nested fashion, significantly reducing the number of tokens that the subsequent language model needs to process. These works were led by Mu Cai and Harris Zhang. https://vinoground.github.io/ https://matryoshka-mm.github.io/