
Research Article
A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning
@INPROCEEDINGS{10.1007/978-3-031-54531-3_21, author={Yanguo Zeng and Meiting Xue and Peiran Xu and Yukun Shi and Kaisheng Zeng and Jilin Zhang and Lupeng Yue}, title={A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning}, proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part III}, proceedings_a={COLLABORATECOM PART 3}, year={2024}, month={2}, keywords={Distributed Machine Learning Synchronous Parallel Communication Prediction Collaborative Computing}, doi={10.1007/978-3-031-54531-3_21} }
- Yanguo Zeng
Meiting Xue
Peiran Xu
Yukun Shi
Kaisheng Zeng
Jilin Zhang
Lupeng Yue
Year: 2024
A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning
COLLABORATECOM PART 3
Springer
DOI: 10.1007/978-3-031-54531-3_21
Abstract
With the development of machine learning technology in various fields, such as medical care, smart manufacturing, etc., the data has exploded. It is a challenge to train a deep learning model for different application domains with large-scale data and limited resources of a single device. The distributed machine-learning technology, which uses a parameter server and multiple clients to train a model collaboratively, is an excellent method to solve this problem. However, it needs much communication between different devices with limited communication resources. The stale synchronous parallel method is a mainstream communication method to solve this problem, but it always leads to high synchronization delay and low computing efficiency as the inappropriate delay threshold value set by the user based on experience. This paper proposes a synchronous parallel method with parameters communication prediction for distributed machine learning. It predicts the optimal timing for synchronization, which can solve the problem of long synchronization waiting time caused by the inappropriate threshold settings in the stale synchronous parallel method. Moreover, it allows fast nodes to continue local training while performing global synchronization, which can improve the resource utilization of work nodes. Experimental results show that compared with the delayed synchronous parallel method, the training time and quality, and resource usage of our method are both significantly improved.