Computer Science Machine Learning Lunch Meetings
Large Multimodal (Vision-Language) Models for Image Generation and Understanding
Speaker: Yong Jae Lee
Abstract: Large Language Models and Large Vision Models, also known as Foundation Models, have led to unprecedented advances in language understanding, visual understanding, and AI. In particular, many computer vision problems including image classification, object detection, and image generation have benefited from the capabilities of such models trained on internet-scale text and visual data. In this talk, I'll present our recent work on Large Multimodal (Vision-Language) Models (LMMs) for controllable image generation (GLIGEN) and language-and-vision chatbot assistance (LLaVA). Since training foundation models from scratch can be prohibitively expensive, a key challenge is how to efficiently and effectively adapt and repurpose them to downstream tasks of interest. I'll provide key insights on how we achieve this, the models' inner workings, and discuss their limitations and future directions.