Neural ocr post-hoc correction of historical corpora

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Lijun Lyu
  • Maria Koutraki
  • Martin Krickl
  • Besnik Fetahu

Organisationseinheiten

Externe Organisationen

  • Österreichische Nationalbibliothek
  • Amazon.com, Inc.
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)479-493
Seitenumfang15
FachzeitschriftTransactions of the Association for Computational Linguistics
Jahrgang9
PublikationsstatusVeröffentlicht - 4 März 2021

Abstract

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

ASJC Scopus Sachgebiete

Zitieren

Neural ocr post-hoc correction of historical corpora. / Lyu, Lijun; Koutraki, Maria; Krickl, Martin et al.
in: Transactions of the Association for Computational Linguistics, Jahrgang 9, 04.03.2021, S. 479-493.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Lyu, L, Koutraki, M, Krickl, M & Fetahu, B 2021, 'Neural ocr post-hoc correction of historical corpora', Transactions of the Association for Computational Linguistics, Jg. 9, S. 479-493. https://doi.org/10.1162/tacl_a_00379
Lyu, L., Koutraki, M., Krickl, M., & Fetahu, B. (2021). Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics, 9, 479-493. https://doi.org/10.1162/tacl_a_00379
Lyu L, Koutraki M, Krickl M, Fetahu B. Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics. 2021 Mär 4;9:479-493. doi: 10.1162/tacl_a_00379
Lyu, Lijun ; Koutraki, Maria ; Krickl, Martin et al. / Neural ocr post-hoc correction of historical corpora. in: Transactions of the Association for Computational Linguistics. 2021 ; Jahrgang 9. S. 479-493.
Download
@article{102cf2babcf1420dafc4799d06ca7720,
title = "Neural ocr post-hoc correction of historical corpora",
abstract = "Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model{\textquoteright}s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.",
author = "Lijun Lyu and Maria Koutraki and Martin Krickl and Besnik Fetahu",
note = "Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).",
year = "2021",
month = mar,
day = "4",
doi = "10.1162/tacl_a_00379",
language = "English",
volume = "9",
pages = "479--493",

}

Download

TY - JOUR

T1 - Neural ocr post-hoc correction of historical corpora

AU - Lyu, Lijun

AU - Koutraki, Maria

AU - Krickl, Martin

AU - Fetahu, Besnik

N1 - Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).

PY - 2021/3/4

Y1 - 2021/3/4

N2 - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

AB - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

UR - http://www.scopus.com/inward/record.url?scp=85110460620&partnerID=8YFLogxK

U2 - 10.1162/tacl_a_00379

DO - 10.1162/tacl_a_00379

M3 - Article

AN - SCOPUS:85110460620

VL - 9

SP - 479

EP - 493

JO - Transactions of the Association for Computational Linguistics

JF - Transactions of the Association for Computational Linguistics

ER -