AI Safety and Theoretical Computer Science, Scott Aaronson (University of Texas at Austin)
Event Details
Abstract:
Progress on AI safety and alignment, like the current AI revolution more generally, has been almost entirely empirical. In this talk,
however, I'll survey a few areas where I think theoretical computer science can contribute to AI safety, including:
- How can we robustly watermark the outputs of Large Language Models and other generative AI systems, to help identify academic cheating, deepfakes, and AI-enabled fraud? I'll explain my proposal and its basic mathematical properties, as well as what remains to be done.
- Can one insert undetectable cryptographic backdoors into neural nets, for good or ill? In what senses can those backdoors also be unremovable? How robust are they against fine-tuning?
- Should we expect neural nets to be "generically" interpretable? I'll discuss a beautiful formalization of that question due to Paul Christiano, along with some initial progress on it, and an unexpected connection to quantum computing.
We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please email dieter@cs.wisc.edu to make a disability-related accommodation request. Reasonable effort will be made to support your request.