PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

Nan Zhang; Connor Heaton; Sean Timothy Okonsky; Prasenjit Mitra; Hilal Ezgi Toraman

Details

Original language	English
Title of host publication	Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation
Subtitle of host publication	LREC-COLING 2024
Editors	Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages	12679-12689
Number of pages	11
ISBN (electronic)	9782493814104
Publication status	Published - 2024
Event	Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 - Hybrid, Torino, Italy Duration: 20 May 2024 → 25 May 2024

Abstract

Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

Keywords

Chemistry-Oriented Document Analysis, Image to Text, OCR Dataset, Optical Character Recognition (OCR)

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
Computational Theory and Mathematics
Computer Science(all)
Computer Science Applications

Cite this

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. / Zhang, Nan; Heaton, Connor; Okonsky, Sean Timothy et al.
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. ed. / Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue. 2024. p. 12679-12689.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Zhang, N, Heaton, C, Okonsky, ST, Mitra, P & Toraman, HE 2024, PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. in N Calzolari, M-Y Kan, V Hoste, A Lenci, S Sakti & N Xue (eds), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. pp. 12679-12689, Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, 20 May 2024. <https://aclanthology.org/2024.lrec-main.1110/>

Zhang, N., Heaton, C., Okonsky, S. T., Mitra, P., & Toraman, H. E. (2024). PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. In N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci, S. Sakti, & N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024 (pp. 12679-12689) https://aclanthology.org/2024.lrec-main.1110/

Zhang N, Heaton C, Okonsky ST, Mitra P, Toraman HE. PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. In Calzolari N, Kan MY, Hoste V, Lenci A, Sakti S, Xue N, editors, Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. 2024. p. 12679-12689

Zhang, Nan ; Heaton, Connor ; Okonsky, Sean Timothy et al. / PEaCE : A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation : LREC-COLING 2024. editor / Nicoletta Calzolari ; Min-Yen Kan ; Veronique Hoste ; Alessandro Lenci ; Sakriani Sakti ; Nianwen Xue. 2024. pp. 12679-12689

Download

@inproceedings{1059efc2665449009a8f5314ccebed6a,

title = "PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents",

abstract = "Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.",

keywords = "Chemistry-Oriented Document Analysis, Image to Text, OCR Dataset, Optical Character Recognition (OCR)",

author = "Nan Zhang and Connor Heaton and Okonsky, {Sean Timothy} and Prasenjit Mitra and Toraman, {Hilal Ezgi}",

note = "Publisher Copyright: {\textcopyright} 2024 ELRA Language Resource Association: CC BY-NC 4.0.; Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024 ; Conference date: 20-05-2024 Through 25-05-2024",

year = "2024",

language = "English",

pages = "12679--12689",

editor = "Nicoletta Calzolari and Min-Yen Kan and Veronique Hoste and Alessandro Lenci and Sakriani Sakti and Nianwen Xue",

booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation",

}

Download

TY - GEN

T1 - PEaCE

T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024

AU - Zhang, Nan

AU - Heaton, Connor

AU - Okonsky, Sean Timothy

AU - Mitra, Prasenjit

AU - Toraman, Hilal Ezgi

PY - 2024

Y1 - 2024

N2 - Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

AB - Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image. While many off-the-shelf OCR models exist, they are often trained for either scientific (e.g., formulae) or generic printed English text. Extracting text from chemistry publications requires an OCR model that is capable in both realms. Nougat, a recent tool, exhibits strong ability to parse academic documents, but is unable to parse tables in PubMed articles, which comprises a significant part of the academic community and is the focus of this work. To mitigate this gap, we present the Printed English and Chemical Equations (PEaCE) dataset, containing both synthetic and real-world records, and evaluate the efficacy of transformer-based OCR models when trained on this resource. Given that real-world records contain artifacts not present in synthetic records, we propose transformations that mimic such qualities. We perform a suite of experiments to explore the impact of patch size, multi-domain training, and our proposed transformations, ultimately finding that models with a small patch size trained on multiple domains using the proposed transformations yield the best performance. Our dataset and code is available at https://github.com/ZN1010/PEaCE.

KW - Chemistry-Oriented Document Analysis

KW - Image to Text

KW - OCR Dataset

KW - Optical Character Recognition (OCR)

UR - http://www.scopus.com/inward/record.url?scp=85195974335&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85195974335

SP - 12679

EP - 12689

BT - Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation

A2 - Calzolari, Nicoletta

A2 - Kan, Min-Yen

A2 - Hoste, Veronique

A2 - Lenci, Alessandro

A2 - Sakti, Sakriani

A2 - Xue, Nianwen

Y2 - 20 May 2024 through 25 May 2024

ER -

Research@Leibniz University

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this