A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents

Wei Hao; Kiminori Matsuzaki; Shigeyuki Sato

Cloud Computing. 10th EAI International Conference, CloudComp 2020, Qufu, China, December 11-12, 2020, Proceedings

Research Article

A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents

Download

13 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-030-69992-5_2,
    author={Wei Hao and Kiminori Matsuzaki and Shigeyuki Sato},
    title={A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents},
    proceedings={Cloud Computing. 10th EAI International Conference, CloudComp 2020, Qufu, China, December 11-12, 2020, Proceedings},
    proceedings_a={CLOUDCOMP},
    year={2021},
    month={2},
    keywords={Large XML documents XPath querying Dual-index Data representation Parallel computing},
    doi={10.1007/978-3-030-69992-5_2}
}

Wei Hao
Kiminori Matsuzaki
Shigeyuki Sato
Year: 2021
A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents
CLOUDCOMP
Springer
DOI: 10.1007/978-3-030-69992-5_2

Wei Hao¹^,*, Kiminori Matsuzaki², Shigeyuki Sato²

1: Anhui University of Science and Technology, Taifeng Avenue 168
2: Kochi University of Technology, 185 Miyanokuchi, Tosayamada, Kami

*Contact email: whao@aust.edu.cn

Abstract

Although XML processing has been intensively studied in recent years, designing efficient implementations for evaluating XPath queries on XML documents remains a challenge in case XML documents are very large. In this study, we implemented a tree-shaped data structure called partial tree that is intrinsically suitable for large XML document processing with multiple computers. Our implementation uses two index sets to accelerate the evaluation of structural relationships among nodes, making it highly efficient for processing very large XML documents regarding three important classes of XPath queries: backward, order-aware and predicate-containing queries. Experiment results show that our implementation outperforms a start-of-the-art XML database BaseX in both absolute loading time and execution time for the target queries. The absolute execution time over 358 GB of XML data averagely is only seconds by using 32 EC2 instances.

Keywords: Large XML documents, XPath querying, Dual-index, Data representation, Parallel computing

Published: 2021-02-13
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-030-69992-5_2

A Dual-Index Based Representation for Processing XPath Queries on Very Large XML Documents

Abstract

About EAI

Community

Publish with EAI