3rd International ICST Conference on Scalable Information Systems

Research Article

Distributed, Large-Scale Latent Semantic Analysis by Index Interpolation

Download613 downloads
  • @INPROCEEDINGS{10.4108/ICST.INFOSCALE2008.3500,
        author={Sebastiano Vigna},
        title={Distributed, Large-Scale Latent Semantic Analysis by Index Interpolation},
        proceedings={3rd International ICST Conference on Scalable Information Systems},
        publisher={ICST},
        proceedings_a={INFOSCALE},
        year={2010},
        month={5},
        keywords={latent semantic analysis search engine},
        doi={10.4108/ICST.INFOSCALE2008.3500}
    }
    
  • Sebastiano Vigna
    Year: 2010
    Distributed, Large-Scale Latent Semantic Analysis by Index Interpolation
    INFOSCALE
    ICST
    DOI: 10.4108/ICST.INFOSCALE2008.3500
Sebastiano Vigna1,*
  • 1: DSI, Università degli Studi di Milano, Italy
*Contact email: vigna@dsi.unimi.it

Abstract

Latent semantic analysis is a well-known technique to extrapolate concepts from a set of documents; it discards noise by reducing the rank of (a variant of) the term/document matrix of a document collection by singular value decomposition. The latter is performed by solving an equivalent symmetric eigenvector problem on a related matrix. Scaling to large set of documents, however, is problematic because every vector-matrix multiplication required by iterative solvers requires a number of multiplications equal to twice the number of postings of the collection. We show how to combine standard search-engine algorithmic tools in such a way to compute (reasonably) quickly the cooccurrence matrix C of a large document collection, and solve directly the associated symmetric eigenvector problem. Albeit the size of C is quadratic in the number of terms, we can distribute its computation among any number of computational unit without increasing the overall number of multiplications. Moreover, our approach is advantageous when the document collection is large, because the number of terms over which latent semantic analysis has to be performed is inherently limited by the size of a language lexicon. We present experiments over a collection with 3.6 billions of postings---two orders of magnitudes larger than any published experiment in the literature.