Offline Reinforcement Learning (RL) deals with the problems where simulation or online interaction is impractical, costly, and/or dangerous, allowing to automate a wide range of applications from healthcare and education to finance and robotics. However, learning new policies from offline data suffers from distributional shifts resulting in extrapolation error, which is infeasible to improve due to lack of additional exploration. Model-free RL algorithms that regularize the policies to stay close to the behavior policy, have limited generalization ability due to the sample complexity issue. Hence, model-based RL approaches that first learn the empirical MDP using the offline dataset and then freely explore in the learned environment for optimal policies, can effectively improve sample efficiency. Nevertheless, state-of-the-art model-based methods either depend on accurate uncertainty estimation techniques or are overly conservative to the support of data.
In this project, we aim at developing more accurate techniques for estimating the model uncertainty in model-based offline RL, inspired from "Random Network Distillation (RND)" idea in this paper. The RND method is originally used for efficient exploration in RL problems with sparse reward. In this project however, we want to utilize the idea to quantify the uncertainty of the empirical MDP for the policy optimization.
The following references should provide you with a good starting point for your literature review: