Research Article
Near-Data Prediction Based Speculative Optimization in a Distribution Environment
@INPROCEEDINGS{10.1007/978-3-030-48513-9_9, author={Mingxu Sun and Xueyan Wu and Dandan Jin and Xiaolong Xu and Qi Liu and Xiaodong Liu}, title={Near-Data Prediction Based Speculative Optimization in a Distribution Environment}, proceedings={Cloud Computing, Smart Grid and Innovative Frontiers in Telecommunications. 9th EAI International Conference, CloudComp 2019, and 4th EAI International Conference, SmartGIFT 2019, Beijing, China, December 4-5, 2019, and December 21-22, 2019}, proceedings_a={CLOUDCOMP}, year={2020}, month={6}, keywords={Distributed systems Hadoop Speculative execution Locally weighted regression Near data prediction}, doi={10.1007/978-3-030-48513-9_9} }
- Mingxu Sun
Xueyan Wu
Dandan Jin
Xiaolong Xu
Qi Liu
Xiaodong Liu
Year: 2020
Near-Data Prediction Based Speculative Optimization in a Distribution Environment
CLOUDCOMP
Springer
DOI: 10.1007/978-3-030-48513-9_9
Abstract
Apache Hadoop is an open source software framework that supports data-intensive distributed applications and is distributed under the Apache 2.0 licensing agreement, where consumers will no longer deal with complex configuration of software and hardware but only pay for cloud services on demand. So how to make the performance of the cloud platform become more important in a consumer-centric environment. There exists imbalance between in some distribution of slow tasks, which results in straggling tasks will have a great influence on the Hadoop framework. By monitoring those tasks in real-time progress and copying the potential Stragglers to a different node, the speculative execution (SE) realizes to improve the probability of finishing those backup tasks before the original ones. The Speculative execution (SE) applies this principle and thus proposed a solution to handle the Straggling tasks. At present, the performance of the Hadoop system is unsatisfying because of the erroneous judgement and inappropriate selection for the backup nodes in the current SE policy. This paper proposes an SE optimized strategy which can be used in prediction of near data. In this strategy, the first step is gathering the real-time task execution information and the remaining runtime required for the task is predicted by a local prediction method. Then it chooses a proper backup node according to the near data and actual demand in the second step. On the other side, this model also includes a cost-effective model in order to make the performance of SE to the peak. The results show that using this strategy in Hadoop effectively improves the accuracy of alternative tasks and effects better in heterogeneous Hadoop environments in various situations, which is beneficial to consumers and cloud platform.