ML+X Forum: Multimodal Learning
ML+X (Machine Learning Community) Event
Event Details
Join the ML+X community on Tuesday, Sept 19, 12-1pm, to explore how researchers are harnessing the power of multimodal machine learning, incorporating diverse data types (e.g., audio and visual) into a unified model. See the speaker lineup below for details, and please register (lunch provided) if you plan to attend!
- Multimodal learning and analysis for understanding single-cell functional genomics in brains and brain diseases, Daifeng Wang
Robust phenotype-genotype associations have been established for human brains and brain disorders. However, understanding the cellular and molecular causes from genotype to phenotype remains elusive. To this end, recent scientific projects have generated large multi-modal datasets such as various omics data at the single-cell and bulk-tissue levels. However, integrating these large-scale multi-modal data and discovering underlying functional mechanisms are still challenging. To address these challenges, I will introduce our recent machine-learning works to analyze single-cell multimodal data for improving genotype-phenotype prediction and interpreting cell-type functional genomics and gene regulation in neuropsychiatric and neurodegenerative diseases. - Transforming healthcare: AI-enhanced disease quantification with vision-language models, Zachary Huemann
Our project's primary objective is to assist physicians in delivering personalized medical care by developing AI tools for quantifying disease burden in patients. To achieve this, we employ a fusion of vision and language models. We utilize free-text reports generated by physicians to guide image analysis, enhancing the accuracy of disease segmentation. This integration is accomplished through a cross-attention mechanism, connecting word embeddings from a large language model with vision features from a vision model. One notable challenge we encountered throughout the project was the limited availability of high-quality physician-generated segmentations for our training data. - The benefits of early fusion: deeply integrated audio-visual representation learning, Pedro Morgado
Humans possess a remarkable ability to integrate auditory and visual information, enabling a deeper understanding of the surrounding environment. Training models to replicate this early fusion, however, presents complexities due to their enhanced expressivity. In this paper, we address this challenge by leveraging the masked reconstruction framework, previously successful in unimodal settings, to train audio-visual encoders with early fusion. We also introduce an attention-based fusion module that captures interactions between local audio and visual representations, enhancing the model's ability to capture fine-grained interactions. Through a series of studies on audio-visual downstream tasks including audio-event classification, visual sound localization, sound separation, and audio-visual segmentation, we demonstrate the efficacy of the proposed architecture and learning procedure. In essence, our approach paves the way for efficient training of deeply integrated audio-visual models, and elevating the usefulness of audio-visual early fusion architectures.
Finding the Orchard View room: The Orchard View room is located on the 3rd floor of Discovery Building — room 3280. To get to the third floor, take the elevator located next to the Aldo’s Cafe kitchen (see photo). If you cannot attend in-person, we invite you to stream the event via Zoom.
Join the ML+X google group: The ML+X community has a google group it uses to send reminders about its upcoming events. If you aren't already a member of the google group, you can use this link to join. Note that you have to be signed into a google account to join the group. If you have any trouble joining, please email faciltator@datascience.wisc.edu