Research Article
Characterizing the Web Using a New Uniform Sampling Approach
@INPROCEEDINGS{10.1109/COMSWA.2007.382558, author={Hamid Mousavi and Mohammad E. Rafiei and Ali Movaghar}, title={Characterizing the Web Using a New Uniform Sampling Approach}, proceedings={2nd International IEEE Conference on Communication System Software and Middleware}, publisher={IEEE}, proceedings_a={COMSWARE}, year={2007}, month={7}, keywords={Uniform Sampling Web Web Search Engine}, doi={10.1109/COMSWA.2007.382558} }
- Hamid Mousavi
Mohammad E. Rafiei
Ali Movaghar
Year: 2007
Characterizing the Web Using a New Uniform Sampling Approach
COMSWARE
IEEE
DOI: 10.1109/COMSWA.2007.382558
Abstract
Web is one the biggest source of information for many. It is also increasingly growing. For easier use of the Web, Web search engines (WSEs) are being used frequently. However, there is little information about the characteristics of the Web and also WSEs. One usual way to analysis these characteristics is to use a uniform sample. In such approaches, instead of working on the entire Web we can work on a small subset of the Web representing entire Web. In this paper, we propose a new method, called bucket-based sampling (BBS), to gather this small but uniform subset of the Web. The analyses show that BBS improves the samples' uniformity, at least 6.95% respecting PAGERANK-SMP, one of the best existing methods. Using samples gathered by BBS, we compare the relative size of seven famous WSEs. We also estimate some important characteristics of the Web. For example we estimate that the size of indexable Web is around 20.14 billion pages.