Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization

Chenping Huang; Bin Cao

Collaborative Computing: Networking, Applications and Worksharing. 18th EAI International Conference, CollaborateCom 2022, Hangzhou, China, October 15-16, 2022, Proceedings, Part I

Research Article

Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-031-24383-7_22,
    author={Chenping Huang and Bin Cao},
    title={Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization},
    proceedings={Collaborative Computing: Networking, Applications and Worksharing. 18th EAI International Conference, CollaborateCom 2022, Hangzhou, China, October 15-16, 2022, Proceedings, Part I},
    proceedings_a={COLLABORATECOM},
    year={2023},
    month={1},
    keywords={Dialogue policy Reinforcement learning World model deactivation PPO},
    doi={10.1007/978-3-031-24383-7_22}
}

Chenping Huang
Bin Cao
Year: 2023
Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization
COLLABORATECOM
Springer
DOI: 10.1007/978-3-031-24383-7_22

Chenping Huang¹^,*, Bin Cao¹

1: College of Computer Science and Technology

*Contact email: huangchenping@zjut.edu.cn

Abstract

Many methods have been proposed to use reinforcement learning to train dialogue policy for task-oriented dialogue systems in recent years. However, the high cost of interacting with users has seriously hindered the development of this field. In order to reduce this interaction cost, the Deep Dyna-Q (DDQ) algorithm and several variants introduce a so-calledworld modelto simulate the user’s response and then use the generated simulated dialogue data to train the dialogue policy. Nevertheless, these methods suffer from two main issues. The first is limited training efficiency due to the Deep-Q Network used. The second is that low-quality simulation dialogue data generated by the world model may hurt the performance of the dialogue policy. To solve these drawbacks, we propose the Dyna Proximal Policy Optimization (DPPO) algorithm. DPPO combines the Proximal Policy Optimization (PPO) algorithm with the world model and uses a deactivation strategy to decide when to stop using the world model for subsequent training. We have conducted experiments on the task of movie ticket booking. Experiments show that our algorithm combines the advantages of DDQ and PPO, which significantly reduces the interaction cost required during training and has a higher task success rate.

Keywords: Dialogue policy, Reinforcement learning, World model deactivation, PPO

Published: 2023-01-25
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-031-24383-7_22

Learning Dialogue Policy Efficiently Through Dyna Proximal Policy Optimization

Abstract

About EAI

Community

Publish with EAI