Skip to main content

Machine Learning Lunch Meeting

How Intelligent Are Current Multimodal Video Models?

Event Details

Date
Friday, December 6, 2024
Time
12:30-1:30 p.m.
Location
Description

Everyone is invited to the weekly Machine Learning Lunch Meetings held Fridays 12:30-1:30pm. Faculty members from Computer Sciences, Statistics, ECE, and other departments will discuss their latest groundbreaking research in machine learning. This is an opportunity to network with faculty and fellow researchers, and to learn about the cutting-edge research being conducted at our university. Please see our website for more information.

Speaker: Yong Jae Lee (CS)

Abstract: In this talk, I will present two recent contributions from my lab that challenge and advance the capabilities of multimodal video models. First, I will introduce a new benchmark called Vinoground, which evaluates the temporal, counterfactual reasoning capabilities of existing models. Spoiler alert: they aren't great (to put it mildly). Second, I will present a novel approach inspired by the Matryoshka doll to improve the efficiency of multimodal models.  It learns to compress the total number of visual tokens in a nested fashion, significantly reducing the number of tokens that the subsequent language model needs to process.  These works were led by Mu Cai and Harris Zhang. https://vinoground.github.io/ https://matryoshka-mm.github.io/

Cost
Free

Tags