RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • Heidelberg University
View graph of relations

Details

Original languageEnglish
Title of host publicationResearch and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings
EditorsYannis Manolopoulos, Jaap Kamps, Giannis Tsakonas, Lazaros Iliadis, Ioannis Karydis
PublisherSpringer Verlag
Pages61-73
Number of pages13
ISBN (print)9783319670072
Publication statusPublished - 2 Sept 2017
Event21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017 - Thessaloniki, Greece
Duration: 18 Sept 201721 Sept 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10450 LNCS
ISSN (Print)0302-9743
ISSN (electronic)1611-3349

Abstract

Temporally annotated corpora about historic events can be crucial to digital humanities research: they allow to extract and date events as well as reactions to them, and to construct timelines of events and of language use, among other applications. However, producing a precise corpus of a particular event in history is very challenging due to the lack of noise-free digitalized data. This paper introduces RussianFlu-DE, a temporally annotated corpus of 639 articles extracted from noisy OCR text of newspaper issues in German. All articles are about the Russian flu epidemic that took place during 1889–1893. We describe the development of RussianFlu-DE, including methods to clean different types of noise in the OCR text, and our tool for extracting Russian flu related articles. In addition, the task of temporal annotation using the TIMEX2 schema is discussed and the characteristics of the corpus compared to other corpora are presented. To show how our contribution supports epidemiology, we present some preliminary yet interesting results obtained from analyzing the articles in RussianFlu-DE. The corpus and associated tools for exploration are publicly available.

Keywords

    Corpus in German, Russian flu epidemic, Temporal annotation, TIMEX2

ASJC Scopus subject areas

Cite this

RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation. / van Canh, Tran; Markert, Katja; Nejdl, Wolfgang.
Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings. ed. / Yannis Manolopoulos; Jaap Kamps; Giannis Tsakonas; Lazaros Iliadis; Ioannis Karydis. Springer Verlag, 2017. p. 61-73 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10450 LNCS).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

van Canh, T, Markert, K & Nejdl, W 2017, RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation. in Y Manolopoulos, J Kamps, G Tsakonas, L Iliadis & I Karydis (eds), Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10450 LNCS, Springer Verlag, pp. 61-73, 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Thessaloniki, Greece, 18 Sept 2017. https://doi.org/10.1007/978-3-319-67008-9_6
van Canh, T., Markert, K., & Nejdl, W. (2017). RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation. In Y. Manolopoulos, J. Kamps, G. Tsakonas, L. Iliadis, & I. Karydis (Eds.), Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings (pp. 61-73). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10450 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-67008-9_6
van Canh T, Markert K, Nejdl W. RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation. In Manolopoulos Y, Kamps J, Tsakonas G, Iliadis L, Karydis I, editors, Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings. Springer Verlag. 2017. p. 61-73. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-319-67008-9_6
van Canh, Tran ; Markert, Katja ; Nejdl, Wolfgang. / RussianFlu-DE : A German Corpus for a Historical Epidemic with Temporal Annotation. Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings. editor / Yannis Manolopoulos ; Jaap Kamps ; Giannis Tsakonas ; Lazaros Iliadis ; Ioannis Karydis. Springer Verlag, 2017. pp. 61-73 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{caf02d4136e54555b08454e689ff4723,
title = "RussianFlu-DE: A German Corpus for a Historical Epidemic with Temporal Annotation",
abstract = "Temporally annotated corpora about historic events can be crucial to digital humanities research: they allow to extract and date events as well as reactions to them, and to construct timelines of events and of language use, among other applications. However, producing a precise corpus of a particular event in history is very challenging due to the lack of noise-free digitalized data. This paper introduces RussianFlu-DE, a temporally annotated corpus of 639 articles extracted from noisy OCR text of newspaper issues in German. All articles are about the Russian flu epidemic that took place during 1889–1893. We describe the development of RussianFlu-DE, including methods to clean different types of noise in the OCR text, and our tool for extracting Russian flu related articles. In addition, the task of temporal annotation using the TIMEX2 schema is discussed and the characteristics of the corpus compared to other corpora are presented. To show how our contribution supports epidemiology, we present some preliminary yet interesting results obtained from analyzing the articles in RussianFlu-DE. The corpus and associated tools for exploration are publicly available.",
keywords = "Corpus in German, Russian flu epidemic, Temporal annotation, TIMEX2",
author = "{van Canh}, Tran and Katja Markert and Wolfgang Nejdl",
note = "Funding information: Acknowledgments. This work is supported by the German Research Foundation (DFG) for the project “Tracking the Russian Flu in U.S. and German Medical and Popular Reports, 1889–1893” on Grant No. NE 638/13-1. We also thank you the Austrian National Library for help in data collection.; 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017 ; Conference date: 18-09-2017 Through 21-09-2017",
year = "2017",
month = sep,
day = "2",
doi = "10.1007/978-3-319-67008-9_6",
language = "English",
isbn = "9783319670072",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "61--73",
editor = "Yannis Manolopoulos and Jaap Kamps and Giannis Tsakonas and Lazaros Iliadis and Ioannis Karydis",
booktitle = "Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings",
address = "Germany",

}

Download

TY - GEN

T1 - RussianFlu-DE

T2 - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017

AU - van Canh, Tran

AU - Markert, Katja

AU - Nejdl, Wolfgang

N1 - Funding information: Acknowledgments. This work is supported by the German Research Foundation (DFG) for the project “Tracking the Russian Flu in U.S. and German Medical and Popular Reports, 1889–1893” on Grant No. NE 638/13-1. We also thank you the Austrian National Library for help in data collection.

PY - 2017/9/2

Y1 - 2017/9/2

N2 - Temporally annotated corpora about historic events can be crucial to digital humanities research: they allow to extract and date events as well as reactions to them, and to construct timelines of events and of language use, among other applications. However, producing a precise corpus of a particular event in history is very challenging due to the lack of noise-free digitalized data. This paper introduces RussianFlu-DE, a temporally annotated corpus of 639 articles extracted from noisy OCR text of newspaper issues in German. All articles are about the Russian flu epidemic that took place during 1889–1893. We describe the development of RussianFlu-DE, including methods to clean different types of noise in the OCR text, and our tool for extracting Russian flu related articles. In addition, the task of temporal annotation using the TIMEX2 schema is discussed and the characteristics of the corpus compared to other corpora are presented. To show how our contribution supports epidemiology, we present some preliminary yet interesting results obtained from analyzing the articles in RussianFlu-DE. The corpus and associated tools for exploration are publicly available.

AB - Temporally annotated corpora about historic events can be crucial to digital humanities research: they allow to extract and date events as well as reactions to them, and to construct timelines of events and of language use, among other applications. However, producing a precise corpus of a particular event in history is very challenging due to the lack of noise-free digitalized data. This paper introduces RussianFlu-DE, a temporally annotated corpus of 639 articles extracted from noisy OCR text of newspaper issues in German. All articles are about the Russian flu epidemic that took place during 1889–1893. We describe the development of RussianFlu-DE, including methods to clean different types of noise in the OCR text, and our tool for extracting Russian flu related articles. In addition, the task of temporal annotation using the TIMEX2 schema is discussed and the characteristics of the corpus compared to other corpora are presented. To show how our contribution supports epidemiology, we present some preliminary yet interesting results obtained from analyzing the articles in RussianFlu-DE. The corpus and associated tools for exploration are publicly available.

KW - Corpus in German

KW - Russian flu epidemic

KW - Temporal annotation

KW - TIMEX2

UR - http://www.scopus.com/inward/record.url?scp=85029576750&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-67008-9_6

DO - 10.1007/978-3-319-67008-9_6

M3 - Conference contribution

AN - SCOPUS:85029576750

SN - 9783319670072

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 61

EP - 73

BT - Research and Advanced Technology for Digital Libraries - 21st International Conference on Theory and Practice of Digital Libraries, TPDL 2017, Proceedings

A2 - Manolopoulos, Yannis

A2 - Kamps, Jaap

A2 - Tsakonas, Giannis

A2 - Iliadis, Lazaros

A2 - Karydis, Ioannis

PB - Springer Verlag

Y2 - 18 September 2017 through 21 September 2017

ER -

By the same author(s)