Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

S. R. Eshwar; Shishir Kolathaya; Gugan Thoppe

Performance Evaluation Methodologies and Tools. 16th EAI International Conference, VALUETOOLS 2023, Crete, Greece, September 6–7, 2023, Proceedings

Research Article

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-48885-6_3,
    author={S. R. Eshwar and Shishir Kolathaya and Gugan Thoppe},
    title={Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking},
    proceedings={Performance Evaluation Methodologies and Tools. 16th EAI International Conference, VALUETOOLS 2023, Crete, Greece, September 6--7, 2023, Proceedings},
    proceedings_a={VALUETOOLS},
    year={2024},
    month={1},
    keywords={Reinforcement learning Evolutionary strategies Off-policy ranking ARS TRES},
    doi={10.1007/978-3-031-48885-6_3}
}

S. R. Eshwar
Shishir Kolathaya
Gugan Thoppe
Year: 2024
Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
VALUETOOLS
Springer
DOI: 10.1007/978-3-031-48885-6_3

S. R. Eshwar^,*, Shishir Kolathaya, Gugan Thoppe

*Contact email: eshwarsr@iisc.ac.in

Abstract

Evolution Strategy (ES) is a potent black-box optimization technique based on natural evolution. A key step in each ES iteration is the ranking of candidate solutions based on some fitness score. In the Reinforcement Learning (RL) context, this step entails evaluating several policies. Presently, this evaluation is done via on-policy approaches: each policy’s score is estimated by interacting several times with the environment using that policy. Such ideas lead to wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies are used for subsequent learning. To improve sample efficiency, we introduce a novel off-policy ranking approach using a local approximation for the fitness function. We demonstrate our idea for two leading ES methods: Augmented Random Search (ARS) and Trust Region Evolution Strategy (TRES). MuJoCo simulations show that, compared to the original methods, our off-policy variants have similar running times for reaching reward thresholds but need only around 70% as much data on average. In fact, in some tasks like HalfCheetah-v3 and Ant-v3, we need just 50% as much data. Notably, our method supports extensive parallelization, enabling our ES variants to be significantly faster than popular non-ES RL methods like TRPO, PPO, and SAC.

Keywords: Reinforcement learning, Evolutionary strategies, Off-policy ranking, ARS, TRES

Published: 2024-01-03
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-48885-6_3

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Abstract

About EAI

Community

Publish with EAI