3rd International ICST Conference on Scalable Information Systems

Research Article

On the Feasibility of Geographically Distributed Web Crawling

Download594 downloads
  • @INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3550,
        author={Berkant Barla Cambazoglu and Flavio Junqueira and Vassilis Plachouras and Luca Telloli},
        title={On the Feasibility of Geographically Distributed Web Crawling},
        proceedings={3rd International ICST Conference on Scalable Information Systems},
        publisher={ICST},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={Distributed Web crawling spatial locality throughput.},
        doi={10.4108/ICST.INFOSCALE2008.3550}
    }
    
  • Berkant Barla Cambazoglu
    Flavio Junqueira
    Vassilis Plachouras
    Luca Telloli
    Year: 2010
    On the Feasibility of Geographically Distributed Web Crawling
    INFOSCALE
    ICST
    DOI: 10.4108/ICST.INFOSCALE2008.3550
Berkant Barla Cambazoglu1,*, Flavio Junqueira1,*, Vassilis Plachouras1,*, Luca Telloli1,*
  • 1: Yahoo! Research Barcelona, Spain
*Contact email: barla@yahoo-inc.com, fpj@yahoo-inc.com, vassilis@yahoo-inc.com, telloli@yahoo-inc.com

Abstract

We identify the issues that are important in design of a geographically distributed Web crawler. The identified issues are discussed from a "benefit" and "challenge" point of view. More specifically, we focus on the effect of geographical locality of Web sites on crawling performance, and, as a practical study, investigate the feasibility of a distributed crawler in terms of network costs. For this purpose, we conduct various experiments to collect network access statistics about the servers in the educational domains of eight different countries (USA, Canada, Chile, Brazil, Spain, Portugal, Turkey, and Greece). We gather the statistics from four different sites located in USA, Brazil, Spain, and Turkey using echoping. The results favor geographically distributed Web crawling in terms of crawling throughput.