A Neural Approach for Text Extraction from Scholarly Figures

Dataset

Researchers

  • David Morris (Creator)
  • Peichen Tang (Creator)
  • Ralph EwerthGerman National Library of Science and Technology (TIB) (Creator)

Research Organisations

External organisation

  • German National Library of Science and Technology (TIB)

Details

Date made available2019
PublisherForschungsdaten-Repositorium der LUH

Description

In recent years, the problem of scene text extraction from images has received extensive attention and significant progress. However, text extraction from scholarly figures such as plots and charts remains an open problem, in part due to the difficulty of locating irregularly placed text lines. To the best of our knowledge, literature has not described the implementation of a text extraction system for scholarly figures that adapts deep convolutional neural networks used for scene text detection. In this paper, we propose a text extraction approach for scholarly figures that forgoes preprocessing in favor of using a deep convolutional neural network for text line localization. Our system uses a publicly available scene text detection approach whose network architecture is well suited to text extraction from scholarly figures. Training data are derived from charts in arXiv papers which are extracted using Allen Institute's pdffigures tool. Since this tool analyzes PDF data as a container format in order to extract text location through the mechanisms which render it, we were able to gather a large set of labeled training samples. We show significant improvement from methods in the literature, and discuss the structural changes of the text extraction pipeline.

We used different sources of data for testing, validation, and training. Our testing set was assembled by the work we cited by Böschen et al. We excluded the DeGruyter dataset, and use it as our validation dataset.

This work was financially supported by the German Federal Ministry of Education and Research (BMBF) and European Social Fund (ESF) (InclusiveOCW project, no. 01PE17004).