
Research Article
A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents
@INPROCEEDINGS{10.1007/978-3-030-69992-5_2, author={Wei Hao and Kiminori Matsuzaki and Shigeyuki Sato}, title={A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents}, proceedings={Cloud Computing. 10th EAI International Conference, CloudComp 2020, Qufu, China, December 11-12, 2020, Proceedings}, proceedings_a={CLOUDCOMP}, year={2021}, month={2}, keywords={Large XML documents XPath querying Dual-index Data representation Parallel computing}, doi={10.1007/978-3-030-69992-5_2} }
- Wei Hao
Kiminori Matsuzaki
Shigeyuki Sato
Year: 2021
A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents
CLOUDCOMP
Springer
DOI: 10.1007/978-3-030-69992-5_2
Abstract
Although XML processing has been intensively studied in recent years, designing efficient implementations for evaluating XPath queries on XML documents remains a challenge in case XML documents are very large. In this study, we implemented a tree-shaped data structure called partial tree that is intrinsically suitable for large XML document processing with multiple computers. Our implementation uses two index sets to accelerate the evaluation of structural relationships among nodes, making it highly efficient for processing very large XML documents regarding three important classes of XPath queries: backward, order-aware and predicate-containing queries. Experiment results show that our implementation outperforms a start-of-the-art XML database BaseX in both absolute loading time and execution time for the target queries. The absolute execution time over 358 GB of XML data averagely is only seconds by using 32 EC2 instances.