Data and AI cluster

Project: Curriculum Learning for Constrained Reinforcement Learning in Contextual MDPs

Description

Sample complexity is one of the core challenges in reinforcement learning (RL)[1]. An RL agent often needs orders of magnitude more data than supervised learning methods to achieve a reasonable performance. This clashes with problems with safety requirements, where the agent should minimize the number of unsafe interactions with the environment [2,3]. Therefore, we must design agents that learn to behave safely as quickly as possible [4].

In this project, we investigate using curriculum learning techniques to generate the tasks in a way that allows the agent to learn faster [5,6]. The idea is to start learning on easy tasks and gradually increase their complexity until the agent can solve the target task. In the constrained setting, this could mean having a large safety-threshold or problems where the costs of interacting with the environment are low.

Furthermore, using curriculum learning based on contextual MDPs [7] approach also has the advantage that the agent can generalize its policy to unseen situations. This way, the agent may learn how to behave safely in two contexts and interpolate its behavior to contexts between them.

References

Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Gowal, S., and Hester, T. (2021). Challenges of real-world reinforcement learning: deﬁnitions, benchmarks and analysis. Machine Learning, 110(9):2419–2468.
Ray, A., Achiam, J., and Amodei, D. (2019). Benchmarking Safe Exploration in Deep Reinforcement Learning.
Yang, Q., Simão, T. D., Tindemans, S. H., and Spaan, M. T. J. (2022). Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, pages 1–29.
Yang, Q., Simão, T. D., Jansen, N., Tindemans, S. H., and Spaan, M. T. J. (2023). Reinforcement Learning by Guided Safe Exploration. In ECAI.
Klink, P., Abdulsamad, H., Belousov, B., D’Eramo, C., Peters, J., and Pajarinen, J. (2021). A probabilistic interpretation of self-paced learning with applications to reinforcement learning. J. Mach. Learn. Res., 22:182:1–182:52.
Koprulu, C., Simão, T. D., Jansen, N., and Topcu, U. (2023). Risk-aware Curriculum Generation for Heavy-tailed Task Distributions. In UAI, pages 1132–1142.
Kirk, R., Zhang, A., Grefenstette, E., and Rocktäschel, T. (2023). A survey of zero-shot generalisation in deep reinforcement learning. J. Artif. Intell. Res., 76:201–264.

Details

Student: KT
Kelvin Toonen
Supervisor: Thiago Simão