About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Collaborative Computing: Networking, Applications and Worksharing. 18th EAI International Conference, CollaborateCom 2022, Hangzhou, China, October 15-16, 2022, Proceedings, Part I

Research Article

Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-24383-7_22,
        author={Chenping Huang and Bin Cao},
        title={Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization},
        proceedings={Collaborative Computing: Networking, Applications and Worksharing. 18th EAI International Conference, CollaborateCom 2022, Hangzhou, China, October 15-16, 2022, Proceedings, Part I},
        proceedings_a={COLLABORATECOM},
        year={2023},
        month={1},
        keywords={Dialogue policy Reinforcement learning World model deactivation PPO},
        doi={10.1007/978-3-031-24383-7_22}
    }
    
  • Chenping Huang
    Bin Cao
    Year: 2023
    Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization
    COLLABORATECOM
    Springer
    DOI: 10.1007/978-3-031-24383-7_22
Chenping Huang1,*, Bin Cao1
  • 1: College of Computer Science and Technology
*Contact email: huangchenping@zjut.edu.cn

Abstract

Many methods have been proposed to use reinforcement learning to train dialogue policy for task-oriented dialogue systems in recent years. However, the high cost of interacting with users has seriously hindered the development of this field. In order to reduce this interaction cost, the Deep Dyna-Q (DDQ) algorithm and several variants introduce a so-calledworld modelto simulate the user’s response and then use the generated simulated dialogue data to train the dialogue policy. Nevertheless, these methods suffer from two main issues. The first is limited training efficiency due to the Deep-Q Network used. The second is that low-quality simulation dialogue data generated by the world model may hurt the performance of the dialogue policy. To solve these drawbacks, we propose the Dyna Proximal Policy Optimization (DPPO) algorithm. DPPO combines the Proximal Policy Optimization (PPO) algorithm with the world model and uses a deactivation strategy to decide when to stop using the world model for subsequent training. We have conducted experiments on the task of movie ticket booking. Experiments show that our algorithm combines the advantages of DDQ and PPO, which significantly reduces the interaction cost required during training and has a higher task success rate.

Keywords
Dialogue policy Reinforcement learning World model deactivation PPO
Published
2023-01-25
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-24383-7_22
Copyright © 2022–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL