Details
Original language | English |
---|---|
Pages (from-to) | 479-493 |
Number of pages | 15 |
Journal | Transactions of the Association for Computational Linguistics |
Volume | 9 |
Publication status | Published - 4 Mar 2021 |
Abstract
Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
ASJC Scopus subject areas
- Computer Science(all)
- Artificial Intelligence
- Computer Science(all)
- Human-Computer Interaction
- Computer Science(all)
- Computer Science Applications
- Social Sciences(all)
- Linguistics and Language
- Social Sciences(all)
- Communication
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Transactions of the Association for Computational Linguistics, Vol. 9, 04.03.2021, p. 479-493.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - Neural ocr post-hoc correction of historical corpora
AU - Lyu, Lijun
AU - Koutraki, Maria
AU - Krickl, Martin
AU - Fetahu, Besnik
N1 - Funding Information: This work was partially funded by Travelogues (DFG: 398697847 and FWF: I 3795-G28).
PY - 2021/3/4
Y1 - 2021/3/4
N2 - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
AB - Optical character recognition (OCR) is crucial for a deeper access to historical collections. OCR needs to account for orthographic variations, typefaces, or language evolution (i.e., new letters, word spellings), as the main source of character, word, or word segmentation transcription errors. For digital corpora of historical prints, the errors are further exacerbated due to low scan quality and lack of language standardization. For the task of OCR post-hoc correction, we propose a neural approach based on a combination of recurrent (RNN) and deep convolutional network (ConvNet) to correct OCR transcription errors. At character level we flexibly capture errors, and decode the corrected output based on a novel attention mechanism. Accounting for the input and output similarity, we propose a new loss function that rewards the model’s correcting behavior. Evaluation on a historical book corpus in German language shows that our models are robust in capturing diverse OCR transcription errors and reduce the word error rate of 32.3% by more than 89%.
UR - http://www.scopus.com/inward/record.url?scp=85110460620&partnerID=8YFLogxK
U2 - 10.1162/tacl_a_00379
DO - 10.1162/tacl_a_00379
M3 - Article
AN - SCOPUS:85110460620
VL - 9
SP - 479
EP - 493
JO - Transactions of the Association for Computational Linguistics
JF - Transactions of the Association for Computational Linguistics
ER -