Smart Data Prefetching Using KNN to Improve Hadoop Performance

Rana Ghazali; Douglas G. Down

sis 25(3):

Research Article

Smart Data Prefetching Using KNN to Improve Hadoop Performance

Download316 downloads

Cite: BibTeX Plain Text

@ARTICLE{10.4108/eetsis.9110,
    author={Rana Ghazali and Douglas G. Down},
    title={Smart Data Prefetching Using KNN to Improve Hadoop Performance},
    journal={EAI Endorsed Transactions on Scalable Information Systems},
    volume={12},
    number={3},
    publisher={EAI},
    journal_a={SIS},
    year={2025},
    month={4},
    keywords={Hadoop Performance, Smart prefetch technique, K-Nearest Neighbor Clustering, MapReduce, Machine Learning, Cache Replacement},
    doi={10.4108/eetsis.9110}
}

Rana Ghazali
Douglas G. Down
Year: 2025
Smart Data Prefetching Using KNN to Improve Hadoop Performance
SIS
EAI
DOI: 10.4108/eetsis.9110

Rana Ghazali¹^,*, Douglas G. Down²

1: Islamic Azad University, Tehran
2: McMaster University

*Contact email: ghazalir@mcmaster.ca

Abstract

Hadoop is an open-source framework that enables the parallel processing of large data sets across a cluster of machines. It faces several challenges that can lead to poor performance, such as I/O operations, network data transmission, and high data access time. In recent years, researchers have explored prefetching techniques to reduce the data access time as a potential solution to these problems. Nevertheless, several issues must be considered to optimize the prefetching mechanism. These include launching the prefetch at an appropriate time to avoid conflicts with other operations and minimize waiting time, determining the amount of prefetched data to avoid overload and underload, and placing the prefetched data in locations that can be accessed efficiently when required. In this paper, we propose a smart prefetch mechanism that consists of three phases designed to address these issues. First, we enhance the task progress rate to calculate the optimal time for triggering prefetch operations. Next, we utilize K-Nearest Neighbor clustering to identify which data blocks should be prefetched in each round, employing the data locality feature to determine the placement of prefetched data. Our experimental results demonstrate that our proposed smart prefetch mechanism improves job execution time by an average of 28.33% by increasing the rate of local tasks.

Keywords: Hadoop Performance, Smart prefetch technique, K-Nearest Neighbor Clustering, MapReduce, Machine Learning, Cache Replacement

Received: 2024-08-28
Accepted: 2024-11-01
Published: 2025-04-17
Publisher: EAI

: http://dx.doi.org/10.4108/eetsis.9110

Copyright © 2025 R. Ghazali et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

Smart Data Prefetching Using KNN to Improve Hadoop Performance

Abstract

About EAI

Community

Publish with EAI