Towards extracting event-centric collections from Web archives

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Gerhard Gossen
  • Thomas Risse
  • Elena Demidova

Organisationseinheiten

Externe Organisationen

  • Goethe-Universität Frankfurt am Main
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)31-45
Seitenumfang15
FachzeitschriftInternational Journal on Digital Libraries
Jahrgang21
Ausgabenummer1
Frühes Online-Datum27 Okt. 2018
PublikationsstatusVeröffentlicht - März 2020

Abstract

Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

ASJC Scopus Sachgebiete

Zitieren

Towards extracting event-centric collections from Web archives. / Gossen, Gerhard; Risse, Thomas; Demidova, Elena.
in: International Journal on Digital Libraries, Jahrgang 21, Nr. 1, 03.2020, S. 31-45.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Gossen, G, Risse, T & Demidova, E 2020, 'Towards extracting event-centric collections from Web archives', International Journal on Digital Libraries, Jg. 21, Nr. 1, S. 31-45. https://doi.org/10.1007/s00799-018-0258-6
Gossen, G., Risse, T., & Demidova, E. (2020). Towards extracting event-centric collections from Web archives. International Journal on Digital Libraries, 21(1), 31-45. https://doi.org/10.1007/s00799-018-0258-6
Gossen G, Risse T, Demidova E. Towards extracting event-centric collections from Web archives. International Journal on Digital Libraries. 2020 Mär;21(1):31-45. Epub 2018 Okt 27. doi: 10.1007/s00799-018-0258-6
Gossen, Gerhard ; Risse, Thomas ; Demidova, Elena. / Towards extracting event-centric collections from Web archives. in: International Journal on Digital Libraries. 2020 ; Jahrgang 21, Nr. 1. S. 31-45.
Download
@article{24111780f8aa4d429e7802fa10d6f1e1,
title = "Towards extracting event-centric collections from Web archives",
abstract = "Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.",
keywords = "Event-centric document collections, Focused crawling, Web archives",
author = "Gerhard Gossen and Thomas Risse and Elena Demidova",
note = "Funding information: This work was partially funded by the ERC under ALEXANDRIA (ERC 339233), H2020 under SoBigData (RIA 654024) and Cleopatra (H2020-MSCA-ITN-2018-812997), and BMBF under Data4UrbanMobility (02K15A040).",
year = "2020",
month = mar,
doi = "10.1007/s00799-018-0258-6",
language = "English",
volume = "21",
pages = "31--45",
number = "1",

}

Download

TY - JOUR

T1 - Towards extracting event-centric collections from Web archives

AU - Gossen, Gerhard

AU - Risse, Thomas

AU - Demidova, Elena

N1 - Funding information: This work was partially funded by the ERC under ALEXANDRIA (ERC 339233), H2020 under SoBigData (RIA 654024) and Cleopatra (H2020-MSCA-ITN-2018-812997), and BMBF under Data4UrbanMobility (02K15A040).

PY - 2020/3

Y1 - 2020/3

N2 - Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

AB - Web archives constitute an increasingly important source of information for computer scientists, humanities researchers and journalists interested in studying past events. However, currently there are no access methods that help Web archive users to efficiently access event-centric information in large-scale archives that go beyond the retrieval of individual disconnected documents. In this article, we tackle the novel problem of extracting interlinked event-centric document collections from large-scale Web archives to facilitate an efficient and intuitive access to information regarding past events. We address this problem by: (1) facilitating users to define event-centric document collections in an intuitive way through a Collection Specification; (2) development of a specialised extraction method that adapts focused crawling techniques to the Web archive settings; and (3) definition of a function to judge the relevance of the archived documents with respect to the Collection Specification taking into account the topical and temporal relevance of the documents. Our extended experiments on the German Web archive (covering a time period of 19 years) demonstrate that our method enables efficient extraction of event-centric collections for different event types.

KW - Event-centric document collections

KW - Focused crawling

KW - Web archives

UR - http://www.scopus.com/inward/record.url?scp=85055897093&partnerID=8YFLogxK

U2 - 10.1007/s00799-018-0258-6

DO - 10.1007/s00799-018-0258-6

M3 - Article

AN - SCOPUS:85055897093

VL - 21

SP - 31

EP - 45

JO - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

SN - 1432-5012

IS - 1

ER -