PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Nan Zhang
  • Connor Heaton
  • Sean Timothy Okonsky
  • Prasenjit Mitra
  • Hilal Ezgi Toraman

Research Organisations

External Research Organisations

  • Pennsylvania State University
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subtitle of host publicationLREC-COLING 2024
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages12679-12689
Number of pages11
ISBN (electronic)9782493814104
Publication statusPublished - 2024
EventJoint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy
Duration: 20 May 202425 May 2024

Abstract

Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

Keywords

    Chemistry-Oriented Document Analysis, Image to Text, OCR Dataset, Optical Character Recognition (OCR)

ASJC Scopus subject areas

Cite this

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. / Zhang, Nan; Heaton, Connor; Okonsky, Sean Timothy et al.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. ed. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. 2024. p. 12679-12689.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Zhang, N, Heaton, C, Okonsky, ST, Mitra, P & Toraman, HE 2024, PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. pp. 12679-12689, Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, 20 May 2024. <https://aclanthology.org/2024.lrec-main.1110/>
Zhang, N., Heaton, C., Okonsky, S. T., Mitra, P., & Toraman, H. E. (2024). PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024 (pp. 12679-12689) https://aclanthology.org/2024.lrec-main.1110/
Zhang N, Heaton C, Okonsky ST, Mitra P, Toraman HE. PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. In Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. 2024. p. 12679-12689
Zhang, Nan ; Heaton, Connor ; Okonsky, Sean Timothy et al. / PEaCE : A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. editor / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. 2024. pp. 12679-12689
Download
@inproceedings{1059efc2665449009a8f5314ccebed6a,
title = "PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents",
abstract = "Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.",
keywords = "Chemistry-Oriented Document Analysis, Image to Text, OCR Dataset, Optical Character Recognition (OCR)",
author = "Nan Zhang and Connor Heaton and Okonsky, {Sean Timothy} and Prasenjit Mitra and Toraman, {Hilal Ezgi}",
note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association: CC BY-NC 4.0.; Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",
year = "2024",
language = "English",
pages = "12679--12689",
editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",

}

Download

TY - GEN

T1 - PEaCE

T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024

AU - Zhang, Nan

AU - Heaton, Connor

AU - Okonsky, Sean Timothy

AU - Mitra, Prasenjit

AU - Toraman, Hilal Ezgi

N1 - Publisher Copyright: © 2024 ELRA Language Resource Association: CC BY-NC 4.0.

PY - 2024

Y1 - 2024

N2 - Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

AB - Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

KW - Chemistry-Oriented Document Analysis

KW - Image to Text

KW - OCR Dataset

KW - Optical Character Recognition (OCR)

UR - http://www.scopus.com/inward/record.url?scp=85195974335&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85195974335

SP - 12679

EP - 12689

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

Y2 - 20 May 2024 through 25 May 2024

ER -