Neural ocr post-hoc correction of historical corpora

Lijun Lyu; Maria Koutraki; Martin Krickl; Besnik Fetahu

doi:10.1162/tacl_a_00379

Details

Original language	English
Pages (from-to)	479-493
Number of pages	15
Journal	Transactions of the Association for Computational Linguistics
Volume	9
Publication status	Published - 4 Mar 2021

Abstract

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

ASJC Scopus subject areas

Computer Science(all)
Artificial Intelligence
Computer Science(all)
Human-Computer Interaction
Computer Science(all)
Computer Science Applications
Social Sciences(all)
Linguistics and Language
Social Sciences(all)
Communication

Cite this

Neural ocr post-hoc correction of historical corpora. / Lyu, Lijun; Koutraki, Maria; Krickl, Martin et al.
In: Transactions of the Association for Computational Linguistics, Vol. 9, 04.03.2021, p. 479-493.

Research output: Contribution to journal › Article › Research › peer review

Lyu, L, Koutraki, M, Krickl, M & Fetahu, B 2021, 'Neural ocr post-hoc correction of historical corpora', Transactions of the Association for Computational Linguistics, vol. 9, pp. 479-493. https://doi.org/10.1162/tacl_a_00379

Lyu, L., Koutraki, M., Krickl, M., & Fetahu, B. (2021). Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics, 9, 479-493. https://doi.org/10.1162/tacl_a_00379

Lyu L, Koutraki M, Krickl M, Fetahu B. Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics. 2021 Mar 4;9:479-493. doi: 10.1162/tacl_a_00379

Lyu, Lijun ; Koutraki, Maria ; Krickl, Martin et al. / Neural ocr post-hoc correction of historical corpora. In: Transactions of the Association for Computational Linguistics. 2021 ; Vol. 9. pp. 479-493.

Download

@article{102cf2babcf1420dafc4799d06ca7720,

title = "Neural ocr post-hoc correction of historical corpora",

abstract = "Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model{\textquoteright}s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.",

author = "Lijun Lyu and Maria Koutraki and Martin Krickl and Besnik Fetahu",

note = "Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).",

year = "2021",

month = mar,

day = "4",

doi = "10.1162/tacl_a_00379",

language = "English",

volume = "9",

pages = "479--493",

}

Download

TY - JOUR

T1 - Neural ocr post-hoc correction of historical corpora

AU - Lyu, Lijun

AU - Koutraki, Maria

AU - Krickl, Martin

AU - Fetahu, Besnik

N1 - Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).

PY - 2021/3/4

Y1 - 2021/3/4

N2 - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

AB - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

UR - http://www.scopus.com/inward/record.url?scp=85110460620&partnerID=8YFLogxK

U2 - 10.1162/tacl_a_00379

DO - 10.1162/tacl_a_00379

M3 - Article

AN - SCOPUS:85110460620

VL - 9

SP - 479

EP - 493

JO - Transactions of the Association for Computational Linguistics

JF - Transactions of the Association for Computational Linguistics

ER -

Research@Leibniz University

Neural ocr post-hoc correction of historical corpora

Authors

Research Organisations

External Research Organisations

Details

Abstract

ASJC Scopus subject areas

Cite this