Scaling Off-Policy Evaluation to High-Dimensional State-Spaces Via State Abstraction
Machine Learning Lunch Meeting: Josiah Hanna, Tuesday Feb 28, 12:15pm CS 1240
You are cordially invited to the weekly CS Machine Learning Lunch Meetings. This is a chance to get to know machine learning professors, and talk to your fellow researchers. Our next meeting will be on Tuesday Feb. 28 12:15-1:15pm in CS 1240. Professor Josiah Hanna will tell us about reinforcement learning, see abstract below.
If you would like to be informed of future CS Machine Learning Lunch Meetings, please sign up our mailing list at https://lists.cs.wisc.edu/mailman/listinfo/mllm -- please use your cs or wisc email. After you enter your email, the system will send you an email for confirmation. Only after you respond to that email will you be on the mailing list.
Abstract: In the problem of off-policy evaluation (OPE) in reinforcement learning (RL), our goal is to estimate the performance of an untested, evaluation policy, using a fixed dataset that was collected by one or more policies that may be different from the evaluation policy. Addressing this problem is crucial for real world applications of RL in which practitioners want to assess how well a learned policy will perform before it is deployed and its actions have real life consequences. Current OPE algorithms may produce poor OPE estimates when policy distribution shift is high i.e., when the probability of a particular state-action pair occurring under the evaluation policy is very different from the probability of that same pair occurring in the observed data (Voloshin et al. 2021; Fu et al. 2021). The problem is particularly common in problems with high-dimensional state-spaces. In this work, we propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space using concepts from the RL state abstraction literature. Specifically, we consider marginalized importance sampling (MIS) OPE algorithms which compute state-action distribution correction ratios to produce their OPE estimate. In the original ground state-space, these ratios may have high variance which then leads to high variance OPE. However, we prove that in the lower-dimensional abstract state-space the ratios can have lower variance resulting in lower variance OPE. We then highlight the challenges that arise when estimating the abstract ratios from data, identify sufficient conditions to overcome these issues, and present a minimax optimization problem whose solution yields these abstract ratios. Finally, our empirical evaluation on difficult, high-dimensional state-space OPE tasks shows that the abstract ratios can make MIS OPE estimators achieve lower mean-squared error and more robust to hyperparameter tuning than the ground ratios.