Offline Reinforcement Learning (RL) deals with the problems where simulation or online interaction is impractical, costly, and/or dangerous, allowing to automate a wide range of applications from healthcare and education to finance and robotics. However, learning new policies from offline data suffers from distributional shifts resulting in extrapolation error, which is infeasible to improve due to lack of additional exploration. Model-free RL algorithms that regularize the policies to stay close to the behavior policy, have limited generalization ability due to the sample complexity issue. Hence, model-based RL approaches that first learn the empirical MDP using the offline dataset and then freely explore in the learned environment for optimal policies, can effectively improve sample efficiency. Nevertheless, state-of-the-art model-based methods either depend on accurate uncertainty estimation techniques or are overly conservative to the support of data.
In this project, we aim to develop a more accurate technique for estimating the dynamics model, inspired from the idea of context-awareness in this paper. While the original idea is designed for cross-domain RL problems, we want to utilize the idea for a singleton task in offline RL for better generalization to out-of-distribution (OOD) data. Note that the term "context" generally refers to an abstraction of a series of stat-action pairs that lead to the current state.
The following references should provide you with a good starting point for your literature review: