Enriching Geolocalized Dataset with POIs Descriptions at Large Scale

Ibrahima Gueye; Hubert Naacke; Stéphane Gançarski

Innovations and Interdisciplinary Solutions for Underserved Areas. 4th EAI International Conference, InterSol 2020, Nairobi, Kenya, March 8-9, 2020, Proceedings

Research Article

Enriching Geolocalized Dataset with POIs Descriptions at Large Scale

Download

167 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.1007/978-3-030-51051-0_19,
    author={Ibrahima Gueye and Hubert Naacke and St\^{e}phane Gan\`{e}arski},
    title={Enriching Geolocalized Dataset with POIs Descriptions at Large Scale},
    proceedings={Innovations and Interdisciplinary Solutions for Underserved Areas. 4th EAI International Conference, InterSol 2020, Nairobi, Kenya, March 8-9, 2020, Proceedings},
    proceedings_a={INTERSOL},
    year={2020},
    month={8},
    keywords={YFCC large scale dataset Distributed query processing Spatial join Apache Spark POI recommendation},
    doi={10.1007/978-3-030-51051-0_19}
}

Ibrahima Gueye
Hubert Naacke
Stéphane Gançarski
Year: 2020
Enriching Geolocalized Dataset with POIs Descriptions at Large Scale
INTERSOL
Springer
DOI: 10.1007/978-3-030-51051-0_19

Ibrahima Gueye¹^,*, Hubert Naacke², Stéphane Gançarski²

1: Ecole Polytechnique de Thiès
2: Sorbonne Université, CNRS, Laboratoire d’Informatique de Paris 6, LIP6

*Contact email: igueye@ept.sn

Abstract

We present an efficient method to enrich a geolocalized dataset with contextual description about Points of Interest (POI). We implemented our solution using two large scale datasets: YFCC [14] and Geonames [2]. A practical problem we have encountered is the size of the manipulated data. Actually, the YFCC geolocalized dataset accounts for 45 million entries that we propose to cross with 12 millions of Geonames POIs. We show that using the Apache Spark cluster computing platform and the GeoSpark [18] spatial join library as-is lead to inefficient computation because of the important bias in the data. We propose a method to distribute the data non uniformly according to the data bias, which greatly improves the spatial join performance. Moreover, we propose a method to select among a set of close POIs, those which are the most relevant with the YFCC entries. The resulting enriched dataset will be made publicly available and should contribute to better validate future works on large scale POI recommendation.

Keywords: YFCC large scale dataset, Distributed query processing, Spatial join, Apache Spark, POI recommendation

Published: 2020-08-06
Appears in: SpringerLink

: http://dx.doi.org/10.1007/978-3-030-51051-0_19

Enriching Geolocalized Dataset with POIs Descriptions at Large Scale

Abstract

About EAI

Community

Publish with EAI