1st International ICST Conference on Scalable Information Systems

Research Article

CiteSeerχ: a scalable autonomous scientific digital library

  • @INPROCEEDINGS{10.1145/1146847.1146865,
        author={Huajing  Li and Isaac G.  Councill and Levent  Bolelli and Ding  Zhou and Yang  Song and Wang-Chien  Lee and Anand  Sivasubramaniam and C. Lee  Giles},
        title={CiteSeerχ: a scalable autonomous scientific digital library},
        proceedings={1st International ICST Conference on Scalable Information Systems},
        publisher={ACM},
        proceedings_a={INFOSCALE},
        year={2006},
        month={6},
        keywords={},
        doi={10.1145/1146847.1146865}
    }
    
  • Huajing Li
    Isaac G. Councill
    Levent Bolelli
    Ding Zhou
    Yang Song
    Wang-Chien Lee
    Anand Sivasubramaniam
    C. Lee Giles
    Year: 2006
    CiteSeerχ: a scalable autonomous scientific digital library
    INFOSCALE
    ACM
    DOI: 10.1145/1146847.1146865
Huajing Li1,*, Isaac G. Councill2,3,*, Levent Bolelli1,*, Ding Zhou1,*, Yang Song1,*, Wang-Chien Lee1,*, Anand Sivasubramaniam1,*, C. Lee Giles4,2,3,*
  • 1: Department of Computer Science and Engineering
  • 2: The School of Information Sciences and Technology, Pennsylvania State University
  • 3: State College, PA 16802, USA
  • 4: Department of Computer Science and Engineering.
*Contact email: huali@cse.psu.edu, icouncil@ist.psu.edu, bolelli@cse.psu.edu, dzhou@cse.psu.edu, yasong@cse.psu.edu, wlee@cse.psu.edu, anand@cse.psu.edu, giles@ist.psu.edu

Abstract

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the fields of computer and information science. Since it's inception in 1997 CiteSeer has grown to index over 730,000 documents and serves over 800,000 requests daily, pushing the limits of the current system's capabilities. In addition, CiteSeer's monolithic architecture inconveniences system maintenance and reduces the flexibility of the system in terms of new feature development, algorithm updates, and system interoperability. In this paper, we discuss the problems of the current CiteSeer architecture and propose a new architecture for a next generation CiteSeer application. The new architecture is based on modular web services and pluggable service components. Preliminary results based on a prototype system show the new architecture enhances flexibility, scalability, and performance for CiteSeer. In addition, new services in development for the next generation CiteSeer system are discussed.