Neural ocr post-hoc correction of historical corpora

Lijun Lyu; Maria Koutraki; Martin Krickl; Besnik Fetahu

doi:10.1162/tacl_a_00379

Details

Originalsprache	Englisch
Seiten (von - bis)	479-493
Seitenumfang	15
Fachzeitschrift	Transactions of the Association for Computational Linguistics
Jahrgang	9
Publikationsstatus	Veröffentlicht - 4 März 2021

Abstract

Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

ASJC Scopus Sachgebiete

Informatik (insg.)
Artificial intelligence
Informatik (insg.)
Mensch-Maschine-Interaktion
Informatik (insg.)
Angewandte Informatik
Sozialwissenschaften (insg.)
Linguistik und Sprache
Sozialwissenschaften (insg.)
Kommunikation

Zitieren

Neural ocr post-hoc correction of historical corpora. / Lyu, Lijun; Koutraki, Maria; Krickl, Martin et al.
in: Transactions of the Association for Computational Linguistics, Jahrgang 9, 04.03.2021, S. 479-493.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Lyu, L, Koutraki, M, Krickl, M & Fetahu, B 2021, 'Neural ocr post-hoc correction of historical corpora', Transactions of the Association for Computational Linguistics, Jg. 9, S. 479-493. https://doi.org/10.1162/tacl_a_00379

Lyu, L., Koutraki, M., Krickl, M., & Fetahu, B. (2021). Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics, 9, 479-493. https://doi.org/10.1162/tacl_a_00379

Lyu L, Koutraki M, Krickl M, Fetahu B. Neural ocr post-hoc correction of historical corpora. Transactions of the Association for Computational Linguistics. 2021 Mär 4;9:479-493. doi: 10.1162/tacl_a_00379

Lyu, Lijun ; Koutraki, Maria ; Krickl, Martin et al. / Neural ocr post-hoc correction of historical corpora. in: Transactions of the Association for Computational Linguistics. 2021 ; Jahrgang 9. S. 479-493.

Download

@article{102cf2babcf1420dafc4799d06ca7720,

title = "Neural ocr post-hoc correction of historical corpora",

abstract = "Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model{\textquoteright}s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.",

author = "Lijun Lyu and Maria Koutraki and Martin Krickl and Besnik Fetahu",

note = "Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).",

year = "2021",

month = mar,

day = "4",

doi = "10.1162/tacl_a_00379",

language = "English",

volume = "9",

pages = "479--493",

}

Download

TY - JOUR

T1 - Neural ocr post-hoc correction of historical corpora

AU - Lyu, Lijun

AU - Koutraki, Maria

AU - Krickl, Martin

AU - Fetahu, Besnik

N1 - Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).

PY - 2021/3/4

Y1 - 2021/3/4

N2 - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

AB - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.

UR - http://www.scopus.com/inward/record.url?scp=85110460620&partnerID=8YFLogxK

U2 - 10.1162/tacl_a_00379

DO - 10.1162/tacl_a_00379

M3 - Article

AN - SCOPUS:85110460620

VL - 9

SP - 479

EP - 493

JO - Transactions of the Association for Computational Linguistics

JF - Transactions of the Association for Computational Linguistics

ER -

Research@Leibniz University

Neural ocr post-hoc correction of historical corpora

Autoren

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren