10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing

Research Article

The Impact of User Corrections On A Crawl-Based Digital Library: A CiteSeerX Perspective

Download634 downloads
  • @INPROCEEDINGS{10.4108/icst.collaboratecom.2014.257563,
        author={Jian Wu and Kyle Williams and Madian Khabsa and C. Giles},
        title={The Impact of User Corrections On A Crawl-Based Digital Library: A CiteSeerX Perspective},
        proceedings={10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing},
        publisher={IEEE},
        proceedings_a={COLLABORATECOM},
        year={2014},
        month={11},
        keywords={digital library crowd-sourcing information extraction user correction},
        doi={10.4108/icst.collaboratecom.2014.257563}
    }
    
  • Jian Wu
    Kyle Williams
    Madian Khabsa
    C. Giles
    Year: 2014
    The Impact of User Corrections On A Crawl-Based Digital Library: A CiteSeerX Perspective
    COLLABORATECOM
    IEEE
    DOI: 10.4108/icst.collaboratecom.2014.257563
Jian Wu,*, Kyle Williams1, Madian Khabsa2, C. Giles3
  • 1: IST,Penn State University
  • 2: CSE,Penn State University
  • 3: IST/CSE,Penn State University
*Contact email: fanchyna@gmail.com

Abstract

CiteSeerX is a crawl-based digital library search engine providing free access to more than 4 million academic papers. It is inevitable for such a digital library to obtain mistakenly parsed metadata, which are retrieved in an automatic manner from PDF files coming from various sources. CiteSeerX offers a feature allowing registered users to correct paper metadata including titles, authors, abstracts, publication years, venues, etc. We claim that user corrections, as a form of crowd-collaboration, provide a useful and efficient way to improve metadata quality and the impact of the digital library. As evidence to support this claim, we investigate user corrections from the last 5 years and analyze: the nature of the corrections; the quality of the corrections; and the impact of the corrections on downloads. Furthermore, we propose a credit-based strategy, in which users are assigned more privileges based on their positive correction activities. We also propose new ways of increasing visibility of mistakenly extracted metadata to promote user correction.