Project: Behavior Cloning for Budgeted Reinforcement Learning


Safety is a core challenge for the deployment of reinforcement learning (RL) in real-world applications [1]. In applications such as recommender systems, this means the agent should respect budget constraints [2]. In this case, the RL agent must compute a policy condition of the available budget that is only revealed at the deployment time.

In recommender systems, it is common to have a static dataset with ratings from users on certain items. This lead to the application of offline RL algorithms [3,4], which learn without direct interactions with the environment. However, it is still an open question how to ensure these algorithms comply with the budget constraints.

Recent developments have proposed to compute new policies for offline RL using behavior cloning (BC) techniques [5,6,7]. The central insight is to condition the policy on the desired performance. This project aims to use the same principles to develop algorithms for RL with budget constraints, where the budget is only revealed at deployment time. In this case, the agent should optimize its performance while only consuming the available budget.

