About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Performance Evaluation Methodologies and Tools. 16th EAI International Conference, VALUETOOLS 2023, Crete, Greece, September 6–7, 2023, Proceedings

Research Article

Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-48885-6_3,
        author={S. R. Eshwar and Shishir Kolathaya and Gugan Thoppe},
        title={Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking},
        proceedings={Performance Evaluation Methodologies and Tools. 16th EAI International Conference, VALUETOOLS 2023, Crete, Greece, September 6--7, 2023, Proceedings},
        proceedings_a={VALUETOOLS},
        year={2024},
        month={1},
        keywords={Reinforcement learning Evolutionary strategies Off-policy ranking ARS TRES},
        doi={10.1007/978-3-031-48885-6_3}
    }
    
  • S. R. Eshwar
    Shishir Kolathaya
    Gugan Thoppe
    Year: 2024
    Improving Sample Efficiency in Evolutionary RL Using Off-Policy Ranking
    VALUETOOLS
    Springer
    DOI: 10.1007/978-3-031-48885-6_3
S. R. Eshwar,*, Shishir Kolathaya, Gugan Thoppe
    *Contact email: eshwarsr@iisc.ac.in

    Abstract

    Evolution Strategy (ES) is a potent black-box optimization technique based on natural evolution. A key step in each ES iteration is the ranking of candidate solutions based on some fitness score. In the Reinforcement Learning (RL) context, this step entails evaluating several policies. Presently, this evaluation is done via on-policy approaches: each policy’s score is estimated by interacting several times with the environment using that policy. Such ideas lead to wasteful interactions since, once the ranking is done, only the data associated with the top-ranked policies are used for subsequent learning. To improve sample efficiency, we introduce a novel off-policy ranking approach using a local approximation for the fitness function. We demonstrate our idea for two leading ES methods: Augmented Random Search (ARS) and Trust Region Evolution Strategy (TRES). MuJoCo simulations show that, compared to the original methods, our off-policy variants have similar running times for reaching reward thresholds but need only around 70% as much data on average. In fact, in some tasks like HalfCheetah-v3 and Ant-v3, we need just 50% as much data. Notably, our method supports extensive parallelization, enabling our ES variants to be significantly faster than popular non-ES RL methods like TRPO, PPO, and SAC.

    Keywords
    Reinforcement learning Evolutionary strategies Off-policy ranking ARS TRES
    Published
    2024-01-03
    Appears in
    SpringerLink
    http://dx.doi.org/10.1007/978-3-031-48885-6_3
    Copyright © 2023–2025 ICST
    EBSCOProQuestDBLPDOAJPortico
    EAI Logo

    About EAI

    • Who We Are
    • Leadership
    • Research Areas
    • Partners
    • Media Center

    Community

    • Membership
    • Conference
    • Recognition
    • Sponsor Us

    Publish with EAI

    • Publishing
    • Journals
    • Proceedings
    • Books
    • EUDL