MadSystems Seminar -- Yeonju Ro (UT-Austin)
Event Details
Title: Breaking the Monolith: Exposing AI Workload Structure for Efficient and Robust Execution
Abstract: Modern generative AI workloads—from massive Mixture-of-Experts (MoEs) to long-context LLMs and multi-agent workflows—are inherently dynamic. Yet system orchestrators are forced to interact with them as rigid, opaque monoliths hidden behind coarse generation APIs. Because systems lack visibility into an AI workload’s internal control flow, memory flexibility, or structural fragility, they are forced into conservative, worst-case execution. This lack of visibility results in severe system bottlenecks: underutilized batching, rigid memory allocation that triggers OOM crashes, and exhaustive, blocking verification.
In this talk, I will argue that unblocking next-generation AI execution requires exposing the right abstractions to make workload internals visible and actionable to the system. First, I will discuss two recent studies (Read-ME and DSLA) where we actively modified underlying model architectures to expose their internal routing and model elasticity to the system. Leveraging this exposed information, we applied late-binding techniques in batching and serving to reduce inference latency and improve system throughput.
The second half of the talk introduces Sherlock, a runtime for multi-agent workflows that recovers the structural reliability profile of the workflow from the system side, without modifying the underlying LLMs. It features an offline analyzer that uses counterfactual analysis to quantify per-node failure impact on the workflow output, allowing the runtime to selectively insert verifiers only at high-risk nodes. This targeted approach, combined with speculative execution to mask verification latency, creates a robust execution environment that maintains high throughput.
Together, these works demonstrate how opening the AI black box unlocks classic system optimizations, paving the way for highly efficient and robust AI systems.
Bio:
Yeonju Ro is a Ph.D. student at UT Austin, co-advised by Aditya Akella and Atlas Wang. Her research sits at the intersection of computer systems and machine learning, with a focus on algorithm–system co-design for next-generation AI systems. She is also a contributor to the Learning-directed Operating System expedition and the Infra AI Center at UT. She is a 2024 IBM Ph.D. Fellow and a 2024 Qualcomm Fellowship Finalist.
We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please call 608-867-6867 or email tomy.1516@gmail.com to make a disability-related accommodation request. Reasonable effort will be made to support your request.