Exploring Web Archives Through Temporal Anchor Texts

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationWebSci 2017 - Proceedings of the 2017 ACM Web Science Conference
Pages289-298
Number of pages10
ISBN (electronic)9781450348966
Publication statusPublished - Jun 2017
Event9th ACM Web Science Conference, WebSci 2017 - Troy, United States
Duration: 25 Jun 201728 Jun 2017

Abstract

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

Keywords

    Big data analysis, Temporal information retrieval, Web archives

ASJC Scopus subject areas

Cite this

Exploring Web Archives Through Temporal Anchor Texts. / Holzmann, Helge; Nejdl, Wolfgang; Anand, Avishek.
WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. p. 289-298.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Holzmann, H, Nejdl, W & Anand, A 2017, Exploring Web Archives Through Temporal Anchor Texts. in WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. pp. 289-298, 9th ACM Web Science Conference, WebSci 2017, Troy, United States, 25 Jun 2017. https://doi.org/10.1145/3091478.3091500
Holzmann, H., Nejdl, W., & Anand, A. (2017). Exploring Web Archives Through Temporal Anchor Texts. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (pp. 289-298) https://doi.org/10.1145/3091478.3091500
Holzmann H, Nejdl W, Anand A. Exploring Web Archives Through Temporal Anchor Texts. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. p. 289-298 doi: 10.1145/3091478.3091500
Holzmann, Helge ; Nejdl, Wolfgang ; Anand, Avishek. / Exploring Web Archives Through Temporal Anchor Texts. WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. pp. 289-298
Download
@inproceedings{6dac31f3f32a42f9a2f49c3a37609b5e,
title = "Exploring Web Archives Through Temporal Anchor Texts",
abstract = "Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.",
keywords = "Big data analysis, Temporal information retrieval, Web archives",
author = "Helge Holzmann and Wolfgang Nejdl and Avishek Anand",
year = "2017",
month = jun,
doi = "10.1145/3091478.3091500",
language = "English",
pages = "289--298",
booktitle = "WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference",
note = "9th ACM Web Science Conference, WebSci 2017 ; Conference date: 25-06-2017 Through 28-06-2017",

}

Download

TY - GEN

T1 - Exploring Web Archives Through Temporal Anchor Texts

AU - Holzmann, Helge

AU - Nejdl, Wolfgang

AU - Anand, Avishek

PY - 2017/6

Y1 - 2017/6

N2 - Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

AB - Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

KW - Big data analysis

KW - Temporal information retrieval

KW - Web archives

UR - http://www.scopus.com/inward/record.url?scp=85026767746&partnerID=8YFLogxK

U2 - 10.1145/3091478.3091500

DO - 10.1145/3091478.3091500

M3 - Conference contribution

AN - SCOPUS:85026767746

SP - 289

EP - 298

BT - WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference

T2 - 9th ACM Web Science Conference, WebSci 2017

Y2 - 25 June 2017 through 28 June 2017

ER -

By the same author(s)