About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part III

Research Article

A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning

Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-031-54531-3_21,
        author={Yanguo Zeng and Meiting Xue and Peiran Xu and Yukun Shi and Kaisheng Zeng and Jilin Zhang and Lupeng Yue},
        title={A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning},
        proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part III},
        proceedings_a={COLLABORATECOM PART 3},
        year={2024},
        month={2},
        keywords={Distributed Machine Learning Synchronous Parallel Communication Prediction Collaborative Computing},
        doi={10.1007/978-3-031-54531-3_21}
    }
    
  • Yanguo Zeng
    Meiting Xue
    Peiran Xu
    Yukun Shi
    Kaisheng Zeng
    Jilin Zhang
    Lupeng Yue
    Year: 2024
    A Synchronous Parallel Method with Parameters Communication Prediction for Distributed Machine Learning
    COLLABORATECOM PART 3
    Springer
    DOI: 10.1007/978-3-031-54531-3_21
Yanguo Zeng1, Meiting Xue2,*, Peiran Xu1, Yukun Shi2, Kaisheng Zeng, Jilin Zhang1, Lupeng Yue1
  • 1: School of Computer Science and Technology, Hangzhou Dianzi University
  • 2: School of Cyberspace
*Contact email: munuan@hdu.edu.cn

Abstract

With the development of machine learning technology in various fields, such as medical care, smart manufacturing, etc., the data has exploded. It is a challenge to train a deep learning model for different application domains with large-scale data and limited resources of a single device. The distributed machine-learning technology, which uses a parameter server and multiple clients to train a model collaboratively, is an excellent method to solve this problem. However, it needs much communication between different devices with limited communication resources. The stale synchronous parallel method is a mainstream communication method to solve this problem, but it always leads to high synchronization delay and low computing efficiency as the inappropriate delay threshold value set by the user based on experience. This paper proposes a synchronous parallel method with parameters communication prediction for distributed machine learning. It predicts the optimal timing for synchronization, which can solve the problem of long synchronization waiting time caused by the inappropriate threshold settings in the stale synchronous parallel method. Moreover, it allows fast nodes to continue local training while performing global synchronization, which can improve the resource utilization of work nodes. Experimental results show that compared with the delayed synchronous parallel method, the training time and quality, and resource usage of our method are both significantly improved.

Keywords
Distributed Machine Learning Synchronous Parallel Communication Prediction Collaborative Computing
Published
2024-02-23
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-031-54531-3_21
Copyright © 2023–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL