About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Industrial Networks and Intelligent Systems. 6th EAI International Conference, INISCOM 2020, Hanoi, Vietnam, August 27–28, 2020, Proceedings

Research Article

Table Structure Recognition in Scanned Images Using a Clustering Method

Download(Requires a free EAI acccount)
2 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.1007/978-3-030-63083-6_12,
        author={Nam Van Nguyen and Hanh Vu and Arthur Zucker and Younes Belkada and Hai Van Do and Doanh Ngoc- Nguyen and Thanh Tuan Nguyen Le and Dong Van Hoang},
        title={Table Structure Recognition in Scanned Images Using a Clustering Method},
        proceedings={Industrial Networks and Intelligent Systems. 6th EAI International Conference, INISCOM 2020, Hanoi, Vietnam, August 27--28, 2020, Proceedings},
        proceedings_a={INISCOM},
        year={2020},
        month={11},
        keywords={Table structure recognition Object recognition Clustering method},
        doi={10.1007/978-3-030-63083-6_12}
    }
    
  • Nam Van Nguyen
    Hanh Vu
    Arthur Zucker
    Younes Belkada
    Hai Van Do
    Doanh Ngoc- Nguyen
    Thanh Tuan Nguyen Le
    Dong Van Hoang
    Year: 2020
    Table Structure Recognition in Scanned Images Using a Clustering Method
    INISCOM
    Springer
    DOI: 10.1007/978-3-030-63083-6_12
Nam Van Nguyen1,*, Hanh Vu2, Arthur Zucker3, Younes Belkada3, Hai Van Do1, Doanh Ngoc- Nguyen1, Thanh Tuan Nguyen Le1, Dong Van Hoang1
  • 1: Thuyloi University, 175 Tay Son
  • 2: Viettel CyberSpace Center, 41st floor
  • 3: Sorbonne University, Polytech Sorbonne
*Contact email: nvnam@tlu.edu.vn

Abstract

Optical Character Recognition (OCR) for scanned paper invoices is very challenging due to the variability of 19 invoice layouts, different information fields, large data tables, and low scanning quality. In this case, table structure recognition is a critical task in which all rows, columns, and cells must be accurately positioned and extracted. Existing methods such as DeepDeSRT, TableNet only dealt with high-quality born-digital images (e.g., PDF) with low noise and apparent table structure. This paper proposes an efficient method called CluSTi (Clustering method for recognition of the Structure of Tables in invoice scanned Images). The contributions of CluSTi are three-fold. Firstly, it removes heavy noises in the table images using a clustering algorithm. Secondly, it extracts all text boxes using state-of-the-art text recognition. Thirdly, based on the horizontal and vertical clustering algorithm with optimized parameters, CluSTi groups the text boxes into their correct rows and columns, respectively. The method was evaluated on three datasets: i) 397 public scanned images; ii) 193 PDF document images from ICDAR 2013 competition dataset; and iii) 281 PDF document images from ICDAR 2019’s numeric tables. The evaluation results showed that CluSTi achieved an(\textit{F}1\textit{-score})of 87.5%, 98.5%, and 94.5%, respectively. Our method also outperformed DeepDeSRT with an(\textit{F}1\textit{-score})of 91.44% on only 34 images from the ICDAR 2013 competition dataset. To the best of our knowledge, CluSTi is the first method to tackle the table structure recognition problem on scanned images.

Keywords
Table structure recognition Object recognition Clustering method
Published
2020-11-21
Appears in
SpringerLink
http://dx.doi.org/10.1007/978-3-030-63083-6_12
Copyright © 2020–2025 ICST
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL