Neural ocr post-hoc correction of historical corpora

Research output: Contribution to journalArticleResearchpeer review

Authors

  • Lijun Lyu
  • Maria Koutraki
  • Martin Krickl
  • Besnik Fetahu

Research Organisations

External Research Organisations

  • Austrian National Library
  • Amazon.com, Inc.
View graph of relations

Details

Original languageEnglish
Pages (from-to)479-493
Number of pages15
JournalTransactions of the Association for Computational Linguistics
Volume9
Publication statusPublished - 4 Mar 2021

Abstract

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

ASJC Scopus subject areas

Cite this

Neural ocr post-hoc correction of historical corpora. / Lyu, Lijun; Koutraki, Maria; Krickl, Martin et al.
In: Transactions of the Association for Computational Linguistics, Vol. 9, 04.03.2021, p. 479-493.

Research output: Contribution to journalArticleResearchpeer review

Lyu, L, Koutraki, M, Krickl, M & Fetahu, B 2021, 'Neural ocr post-hoc correction of historical corpora', Transactions of the Association for Computational Linguistics, vol. 9, pp. 479-493. https://doi.org/10.1162/tacl_a_00379
Lyu, L., Koutraki, M., Krickl, M., & Fetahu, B. (2021). Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics, 9, 479-493. https://doi.org/10.1162/tacl_a_00379
Lyu L, Koutraki M, Krickl M, Fetahu B. Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics. 2021 Mar 4;9:479-493. doi: 10.1162/tacl_a_00379
Lyu, Lijun ; Koutraki, Maria ; Krickl, Martin et al. / Neural ocr post-hoc correction of historical corpora. In: Transactions of the Association for Computational Linguistics. 2021 ; Vol. 9. pp. 479-493.
Download
@article{102cf2babcf1420dafc4799d06ca7720,
title = "Neural ocr post-hoc correction of historical corpora",
abstract = "Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model{\textquoteright}s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.",
author = "Lijun Lyu and Maria Koutraki and Martin Krickl and Besnik Fetahu",
note = "Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).",
year = "2021",
month = mar,
day = "4",
doi = "10.1162/tacl_a_00379",
language = "English",
volume = "9",
pages = "479--493",

}

Download

TY - JOUR

T1 - Neural ocr post-hoc correction of historical corpora

AU - Lyu, Lijun

AU - Koutraki, Maria

AU - Krickl, Martin

AU - Fetahu, Besnik

N1 - Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).

PY - 2021/3/4

Y1 - 2021/3/4

N2 - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

AB - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

UR - http://www.scopus.com/inward/record.url?scp=85110460620&partnerID=8YFLogxK

U2 - 10.1162/tacl_a_00379

DO - 10.1162/tacl_a_00379

M3 - Article

AN - SCOPUS:85110460620

VL - 9

SP - 479

EP - 493

JO - Transactions of the Association for Computational Linguistics

JF - Transactions of the Association for Computational Linguistics

ER -