First EAI International Conference on Computer Science and Engineering

Research Article

An Extractive Multi-document Summarization System for Malayalam News Documents

Download881 downloads
  • @INPROCEEDINGS{10.4108/eai.27-2-2017.152340,
        author={Manju K and David Peter S and Sumam Mary idicula},
        title={An Extractive Multi-document Summarization System for Malayalam News Documents},
        proceedings={First EAI International Conference on Computer Science and Engineering},
        publisher={EAI},
        proceedings_a={COMPSE},
        year={2017},
        month={3},
        keywords={Multidocument Summarization Malayalam Language Sentence Scoring Extractive Heuristic measures Word Net},
        doi={10.4108/eai.27-2-2017.152340}
    }
    
  • Manju K
    David Peter S
    Sumam Mary idicula
    Year: 2017
    An Extractive Multi-document Summarization System for Malayalam News Documents
    COMPSE
    EAI
    DOI: 10.4108/eai.27-2-2017.152340
Manju K1,*, David Peter S, Sumam Mary idicula
  • 1: College of Engineering,Cherthala, Alappuzha, Kerala, India
*Contact email: manju@mec.ac.in

Abstract

The flooding of digital data necessitates the need for a system that can take information from multiple documents and provide it in a summarized form. Due to the unavailability of automatic tool for summarizing Malayalam documents, this work serves as an introduction. In this work, we have investigated on an extractive multi document summarizer for Malayalam language which uses a sentence scoring technique. An online Malayalam Wordnet is used in the work for semantic similarity checking. Sentence score is calculated based on the features selected for each sentence. Feature selection is done by considering the heuristic measures like sentence length, sentence position, presence of numerical data, existence of proper noun in a sentence, term frequency-inverse document frequency in the documents. Top ranking sentences are selected as initial summary. Then cosine similarity measure is applied to remove redundancies and the summary is generated as per the length specified. Experimental results demonstrates the effectiveness of the proposed system on the data set selected as bench mark.