
Research Article
Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
@INPROCEEDINGS{10.1007/978-3-031-48885-6_3, author={S. R. Eshwar and Shishir Kolathaya and Gugan Thoppe}, title={Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking}, proceedings={Performance Evaluation Methodologies and Tools. 16th EAI International Conference, VALUETOOLS 2023, Crete, Greece, September 6--7, 2023, Proceedings}, proceedings_a={VALUETOOLS}, year={2024}, month={1}, keywords={Reinforcement learning Evolutionary strategies Off-policy ranking ARS TRES}, doi={10.1007/978-3-031-48885-6_3} }
- S. R. Eshwar
Shishir Kolathaya
Gugan Thoppe
Year: 2024
Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
VALUETOOLS
Springer
DOI: 10.1007/978-3-031-48885-6_3
Abstract
Evolution Strategy (ES) is a potent black-box optimization technique based on natural evolution. A key step in each ES iteration is the ranking of candidate solutions based on some fitness score. In the Reinforcement Learning (RL) context, this step entails evaluating several policies. Presently, this evaluation is done via on-policy approaches: each policy’s score is estimated by interacting several times with the environment using that policy. Such ideas lead to wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies are used for subsequent learning. To improve sample efficiency, we introduce a novel off-policy ranking approach using a local approximation for the fitness function. We demonstrate our idea for two leading ES methods: Augmented Random Search (ARS) and Trust Region Evolution Strategy (TRES). MuJoCo simulations show that, compared to the original methods, our off-policy variants have similar running times for reaching reward thresholds but need only around 70% as much data on average. In fact, in some tasks like HalfCheetah-v3 and Ant-v3, we need just 50% as much data. Notably, our method supports extensive parallelization, enabling our ES variants to be significantly faster than popular non-ES RL methods like TRPO, PPO, and SAC.