Amethod for integrating GIS and big data platforms

Geographic Information System (GIS) has been played an important role in many applications of our daily life since 1970. Recently, with the rapid development of new technologies, earth’s data increases explosively. Many studies have been proposed to extend big data platforms with spatial data storage and processing. GIS users, however, still need a method to work with a large of data sets with traditional tools. This paper proposes a method to integrate ArcMap with Apache Hadoop and its ecosystem. The method has two phases including database creating and querying in Apache Hive. There are two tools following the proposed method are developed for illustration purpose. The experiment results on a data set of taxi trips in a year show that the method impressively improves the query performance. Received on 01 July 2021; accepted on 09 July 2021; published on 14 July 2021


Introduction
Geographic Information System (GIS) is a system for storing and analysing Earth's data that became widely familiar in the '70s. Since then, it has been played an important role and used in many applications of everyday life. Various types of data sources can be ingested into a GIS producing maps with many data layers. Users are possible to make deep analysis within an interested area and over the period of time using GIS. For this reasons, a lot of tools which are both open source and commercial have been created to support users to dive into GIS. Among them, AcrGIS is one of the most popular commercial GIS solutions provided by ERSI [3]. AcrGIS includes many applications which allows users to work with GIS. GIS data also grows exponentially in size that requires GIS frameworks and tools to adapt. Therefore, researchers have been working out to leverage GIS with big data technologies.
Big data is a recent emerging term that describes the very large volume and complex data to process with traditional systems. The definition of Big Data comes with 5Vs (volume, velocity, veracity, variety, and value). Because the source are varied from structure to unstructured data with huge amount of volume and * Corresponding author. Email: lehonganh@humg.edu.vn generated at fast speed, therefore, appropriated storage and processing platforms such as Apache Hadoop, Apache Spark [2] and their ecosystem components are the need. Apache Hadoop and Spark are open source platforms originated by Apache Foundation Organization consisting of components for big data storage, processing, and analysis. The advancements of such technologies motivate the awareness of big data in the next generation of GIS. Many researchers have been dedicated for combining and integrating big data technologies and GIS. Peng Yue et al. [11] introduced the term "BigGIS" and several key considerations of the development. Research work [6] has been investigated the extension for spatial processing in Hadoop. Jia Yu et al. [10] introduced which is an inmemory computing framework for large-scale spatial data processing. Dong Xi et al. [9] presented Simba (Spatial In-Memory Big data Analytic) offering largescale in-memory spatial queries. Even though, several remarkable results are achieved. There is a need for traditional GIS desktop tools such as ArcMap to interact with big data platforms. It will benefit GIS desktop users with the pros of new technologies while still make use of the classic tools. In the same direction, Ersi's team initially developed a set of tool for processing spatial data in Hadoop. It, however, still requires a lot of complex skill with Hadoop that is unfamiliar with 1 EAI Endorsed Transactions on Context-aware Systems and Applications 12 2020 -07 2021 | Volume 7 | Issue 23 | e5

EAI Endorsed Transactions on Context-aware Systems and Applications
Research Article GIS users. To complement this, the article presents an approach to seamlessly integrate ArcMap with Apache Hive. The contributions of the paper are (i) a method to integrate ArcMap queries with Apache Hive queries seamlessly; (ii) a set of ArcMap python tools that create and queries big data in Apaache Hive.
The article is structured as follows. Section 3 comes with background of ArcMap and Apache Hive. Followed by section 2 summarizing the related work. The proposed integration approach is presented in section 4. The development of the tool sets of Taxi trips case study are described in Section 5. Finally, section 6 concludes the article and presents the future work.

Related Work
Ersi already provided a set of tools of GIS extension for Hadoop [4]. Our tool is also developed based on their work. The difference is that this paper proposed a generic method for integration with detailed steps and more automatic.
Shaohua et al. [8] proposed and developed an integrated GIS plaform with many other latest big data technologies called SuperMap GIS. The platform is integrated and compatible with many powerful platforms such as Spark, Kafka, etc.. But this product is not an open source one and belongs to SuperMap company. They only provide the SuperMap GIS for 90 trial days.
Lopez Vega, M.A et al. [7] developed a near real-time environment monitoring system that based on GEOS-R satellite imagery. The system focused on storage satellite imagery data rather than processing.
Ahmed Eldawy and Mohamed F. Mokbel [6] proposed a spatial extension of Hadoop named Spatial- Hadoop. It is an open source platform providing native support for spatial data types and operation. It adapts several spatial structures such as Grid, R-tree, and R+tree. In the same direction, Ablimit Aji et al. [5] proposed Hadoop-GIS, a large scale warehouse for spatial data, that supports multiple types of spatial queries on MapReduce.

GIS and ArcGIS
GIS has three major components including data, hardware, and software. A GIS application allows users to work with digital maps, create new data layer to customize the maps, and analyze the spatial information. The common functionality of GIS is associating non-geographical information with places locations. There are two types of GIS data such as vector and raster data. The former consists of three types: point, line, and polygon data. The later has three types of data sets such as imagery, spectral, and thematic data. Due to huge requests of using GIS, there a bunch of both commercial and open source tools and platforms. ArcGIS is a world-wide popular solution GIS of Ersi. They provides a lot of products and utilities for GIS users with various technologies such as desktop, web, or cloud computing. Ersi also has released several new products also adapts to big data technologies. The traditional tool such as ArcMap, however, is still familiar with many users. As mentioned in the previous section, ArcMap users feel difficult to work with big data sets in ArcGIS desktop environment. In order to make ArcGIS desktop extension, Ersi allows developers to implement ArcMap toolbox with ModelBuilder and Python. The template for creating Python toolbox is described as the following snippets.  Apache Hive is an open source warehouse to process structured data in Hadoop. It was initially developed by Facebook, then transferred to ASF. Hive stores schema in a database, while processes data in HDFS. It supports query language which is similar to SQL called HiveQL. The basic architecture of Apache Hive is illustrated in Figure 2.

Integration between ArcMap and Hive
This section presents the approach of integration between ArcMap and Apache that allows ArcMap users execute a query over a big data set using the Hive engine running in a Hadoop cluster. The overview of the approach is illustrated in Figure 3.
The system will consist of two components including ArcMap desktop that handle spatial analysis tasks and a Hadoop cluster with Hive installed for storing and processing data. The proposed approach allows to separate GIS features and big data processing. It provides a transparency to ArcMap users and does not require them to manually do the complex tasks such as writing Hive scripts. These scripts are generated automatically. After execution in Hadoop, the query results are sent back ArcMap in form of JSON data file. The integration reuses advancement of distributed storage and processing in Hive because this feature in ArcMap software is limited. GIS users then just only have to handle the very smaller data set results returned from Hadoop cluster. In order to implement the model, the paper makes use of the method extending ArcMap plugins as toolboxes. Figure 4 shows the detailed steps that integrate a AcrMap toolbox to Hive engine to query big data sets. The integration consists of two phases such as establishing Hive database and making Hive queries. In order to create Hive database from the big data set, we need to follow the steps below

Data Sources
In this example, we need to use ArcMap to analyze data and visualize the results over a big data set of taxi trips in New York in 2013. The monthly data is stored in a CSV file that has approximately 2GB in size and around few millions records. The detailed information of these data files are given in Table 1. The total size of the data source is approximately 23GB with more than 20 millions records. The data with such huge volume will cause ArcMap to work very slow when loading and filtering the data. To solve this, we follow the proposed method to develop a integration tool with Hadoop and Hive. The data file contains some specific fields as follows • pickup datetime: denotes pick up time of the trip.
• dropoff datetime: drop off time of the trip.

Implementation and Results
Toolbox for creating and import database in Hive. This tool contains two input parameters indicating the  The process flow is described in detail as follows • Connect to HDFS server using configuration file.
• If we can connect from ArcMap, then get input parameters from GUI of toolbox and do the next steps.
• Create HiveQL for querying the data with corresponding input parameters • Start running PyHive connecting to Hive engine in the toolbox.
• If the query returns results successfully, then export the query result to a JSON file stored in HDFS.
• ArcMap toolbox copies JSON file from HDFS cluster to visualize the result. Figure 7 shows the developed toolbox for querying data that runs in the ArcMap.

Results.
In order to evaluate the developed tool, we have evaluated with 02 data set. One data set is 5 EAI Endorsed Transactions on Context-aware Systems and Applications 12 2020 -07 2021 | Volume 7 | Issue 23 | e5  Figure 8 shows the results when visualizing 100 furthest distance trips in December and the whole year.
ArcMap alone takes approximately around 2 hours to complete processing data amount in December while the developed toolbox completed with under 10 minutes. With approximately 26GB data of the whole year, ArcMap finds noway to do the filter because the data is so huge, while the constructed tool complete the task within around 50 minutes if the data source file already exists in HDFS.

Conclusions
This paper proposes a method to develop a toolbox for processing big data in a ArcGIS desktop tool. It provides detailed steps to implement ArcMap toolbox connecting to Hive server engine. Two toolboxes have been constructed for the illustration purpose. It shows that the tool improves the performance of queries around ten times with a big data set. The experiment result, however, is still limited because we have not used a Hadoop cluster for computing. Moreover, the query structure is still simple with filtering a column in a table. We intend to work further to make the tool more