How Good is Query Optimizer in Spark?

Zujie Ren; Na Yun; Youhuizi Li; Jian Wan; Yuan Wang; Lihua Yu; Xinxin Fan

Collaborative Computing: Networking, Applications and Worksharing. 14th EAI International Conference, CollaborateCom 2018, Shanghai, China, December 1-3, 2018, Proceedings

Research Article

How Good is Query Optimizer in Spark?

Download

439 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-030-12981-1_42,
    author={Zujie Ren and Na Yun and Youhuizi Li and Jian Wan and Yuan Wang and Lihua Yu and Xinxin Fan},
    title={How Good is Query Optimizer in Spark?},
    proceedings={Collaborative Computing: Networking, Applications and Worksharing. 14th EAI International Conference, CollaborateCom 2018, Shanghai, China, December 1-3, 2018, Proceedings},
    proceedings_a={COLLABORATECOM},
    year={2019},
    month={2},
    keywords={Spark SQL Catalyst Query optimization},
    doi={10.1007/978-3-030-12981-1_42}
}

Zujie Ren
Na Yun
Youhuizi Li
Jian Wan
Yuan Wang
Lihua Yu
Xinxin Fan
Year: 2019
How Good is Query Optimizer in Spark?
COLLABORATECOM
Springer
DOI: 10.1007/978-3-030-12981-1_42

Zujie Ren¹^,*, Na Yun¹, Youhuizi Li¹, Jian Wan², Yuan Wang³, Lihua Yu³, Xinxin Fan³

1: Hangzhou Dianzi University
2: Zhejiang University of Science and Technology
3: Netease Hangzhou, Network Co. Ltd.

*Contact email: renzju@gmail.com

Abstract

In the big data community, Spark plays an important role and is used to process interactive queries. Spark employs a query optimizer, called Catalyst, to interpret SQL queries to optimized query execution plans. Catalyst contains a number of optimization rules and supports cost-based optimization. Although query optimization techniques have been well studied in the field of relational database systems, the effectiveness of Catalyst in Spark is still unclear. In this paper, we investigated the effectiveness of rule-based and cost-based optimization in Catalyst, meanwhile, we obtained a set of comparative experiments by varying the data volume and the number of nodes. It is found that even when applied query optimizations, the execution time of most TPC-H queries were slightly reduced. Some interesting observations were made on Catalyst, which can enable the community to have a better understanding and improvement of the query optimizer in Spark.

Keywords: Spark SQL Catalyst Query optimization

Published: 2019-02-07
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-030-12981-1_42