sis 18: e21

Research Article

An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree

Download21 downloads
  • @ARTICLE{10.4108/eai.23-10-2020.166717,
        author={Shashank Srivastav and P. K. Singh and Divakar Yadav},
        title={An  approach  for  fast compressed text  matching and  to avoid false matching using WBTC and wavelet tree},
        journal={EAI Endorsed Transactions on Scalable Information Systems: Online First},
        volume={},
        number={},
        publisher={EAI},
        journal_a={SIS},
        year={2020},
        month={10},
        keywords={Modern Information Retrieval, Wavelet Tree, Word-Based Tagged Code, Compressed Pattern Matching},
        doi={10.4108/eai.23-10-2020.166717}
    }
    
  • Shashank Srivastav
    P. K. Singh
    Divakar Yadav
    Year: 2020
    An approach for fast compressed text matching and to avoid false matching using WBTC and wavelet tree
    SIS
    EAI
    DOI: 10.4108/eai.23-10-2020.166717
Shashank Srivastav1, P. K. Singh1, Divakar Yadav2,*
  • 1: Madan Mohan Malaviya University of Technology, Gorakhpur, Uttar Pradesh, India
  • 2: National Institute of Technology, Hamirpur, Himachal Pradesh, India
*Contact email: dsy99@rediffmail.com

Abstract

Text matching is a process of finding the frequency of occurrences of text pattern in a corpus. It's very costly to store, process, and retrieve a vast volume of text data. In this paper, we present a method to keep the massive text corpus in lesser memory space by using text compression and to retrieve the results by matching directly on this compressed corpus without decompression using compressed pattern matching (CPM). The proposed approach also helps to minimize the time taken to perform matching without compromising the false matching results. We used word-based tagged coding to perform text compression and Wavelet Trees for representing the compressed text in memory. The proposed Text Matching in Compressed text using Parallel Wavelet Tree (TMC_PWT) method is quite fast in comparison to other existing text matching algorithms that support CPM. In the context of CPM, the proposed method provides a good compression ratio and does not suffer from the problem of false matching.