2nd International IEEE Conference on Communication System Software and Middleware

Research Article

Characterizing the Web Using a New Uniform Sampling Approach

  • @INPROCEEDINGS{10.1109/COMSWA.2007.382558,
        author={Hamid  Mousavi and Mohammad E. Rafiei and Ali Movaghar},
        title={Characterizing the Web Using a New Uniform Sampling Approach},
        proceedings={2nd International IEEE Conference on Communication System Software and Middleware},
        publisher={IEEE},
        proceedings_a={COMSWARE},
        year={2007},
        month={7},
        keywords={Uniform Sampling  Web  Web Search Engine},
        doi={10.1109/COMSWA.2007.382558}
    }
    
  • Hamid Mousavi
    Mohammad E. Rafiei
    Ali Movaghar
    Year: 2007
    Characterizing the Web Using a New Uniform Sampling Approach
    COMSWARE
    IEEE
    DOI: 10.1109/COMSWA.2007.382558
Hamid Mousavi1,*, Mohammad E. Rafiei1,*, Ali Movaghar1,*
  • 1: CE Department, University of Tech., Tehran, Iran.
*Contact email: h_mousavig@ce.sharif.edu, rafieig@ce.sharif.edu, movagharg@sharif.edu

Abstract

Web is one the biggest source of information for many. It is also increasingly growing. For easier use of the Web, Web search engines (WSEs) are being used frequently. However, there is little information about the characteristics of the Web and also WSEs. One usual way to analysis these characteristics is to use a uniform sample. In such approaches, instead of working on the entire Web we can work on a small subset of the Web representing entire Web. In this paper, we propose a new method, called bucket-based sampling (BBS), to gather this small but uniform subset of the Web. The analyses show that BBS improves the samples' uniformity, at least 6.95% respecting PAGERANK-SMP, one of the best existing methods. Using samples gathered by BBS, we compare the relative size of seven famous WSEs. We also estimate some important characteristics of the Web. For example we estimate that the size of indexable Web is around 20.14 billion pages.