Exploring Web Archives Through Temporal Anchor Texts

Helge Holzmann; Wolfgang Nejdl; Avishek Anand

doi:10.1145/3091478.3091500

Details

Original language	English
Title of host publication	WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference
Pages	289-298
Number of pages	10
ISBN (electronic)	9781450348966
Publication status	Published - Jun 2017
Event	9th ACM Web Science Conference, WebSci 2017 - Troy, United States Duration: 25 Jun 2017 → 28 Jun 2017

Abstract

Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

Keywords

Big data analysis, Temporal information retrieval, Web archives

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications

Cite this

Exploring Web Archives Through Temporal Anchor Texts. / Holzmann, Helge; Nejdl, Wolfgang; Anand, Avishek.
WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. p. 289-298.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Holzmann, H, Nejdl, W & Anand, A 2017, Exploring Web Archives Through Temporal Anchor Texts. in WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. pp. 289-298, 9th ACM Web Science Conference, WebSci 2017, Troy, United States, 25 Jun 2017. https://doi.org/10.1145/3091478.3091500

Holzmann, H., Nejdl, W., & Anand, A. (2017). Exploring Web Archives Through Temporal Anchor Texts. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference (pp. 289-298) https://doi.org/10.1145/3091478.3091500

Holzmann H, Nejdl W, Anand A. Exploring Web Archives Through Temporal Anchor Texts. In WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. p. 289-298 doi: 10.1145/3091478.3091500

Holzmann, Helge ; Nejdl, Wolfgang ; Anand, Avishek. / Exploring Web Archives Through Temporal Anchor Texts. WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference. 2017. pp. 289-298

Download

@inproceedings{6dac31f3f32a42f9a2f49c3a37609b5e,

title = "Exploring Web Archives Through Temporal Anchor Texts",

abstract = "Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.",

keywords = "Big data analysis, Temporal information retrieval, Web archives",

author = "Helge Holzmann and Wolfgang Nejdl and Avishek Anand",

year = "2017",

month = jun,

doi = "10.1145/3091478.3091500",

language = "English",

pages = "289--298",

booktitle = "WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference",

note = "9th ACM Web Science Conference, WebSci 2017 ; Conference date: 25-06-2017 Through 28-06-2017",

}

Download

TY - GEN

T1 - Exploring Web Archives Through Temporal Anchor Texts

AU - Holzmann, Helge

AU - Nejdl, Wolfgang

AU - Anand, Avishek

PY - 2017/6

Y1 - 2017/6

N2 - Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

AB - Web archives have been instrumental in digital preservation of the Web and provide great opportunity for the study of the societal past and evolution. These Web archives are massive collections, typically in the order of terabytes and petabytes. Due to this, search and exploration of archives has been limited as full-text indexing is both resource and computationally expensive. We identify that for typical access methods to archives, which are navigational and temporal in nature, we do not always require indexing full-text. Instead, meaningful text surrogates like anchor texts already go a long way in providing meaningful solutions and can act as reasonable entry points to exploring Web archives. In this paper, we present a new approach to searching Web archives based on temporal link graphs and corresponding anchor texts. Departing from traditional informational intents, we show how temporal anchor texts can be effective in answering queries beyond purely navigational intents, like finding the most central webpages of an entity in a given time period. We propose indexing methods and a temporal retrieval model based on anchor texts. Further, we discuss several interesting search results as well as one experiment in which we demonstrate how such results can be integrated in a data processing workflow to scale up to thousands of pages. In this analysis we were able to replicate results reported by an offline study, showing that restaurant prices indeed increased in Germany when the Euro was introduced as Europe's currency.

KW - Big data analysis

KW - Temporal information retrieval

KW - Web archives

UR - http://www.scopus.com/inward/record.url?scp=85026767746&partnerID=8YFLogxK

U2 - 10.1145/3091478.3091500

DO - 10.1145/3091478.3091500

M3 - Conference contribution

AN - SCOPUS:85026767746

SP - 289

EP - 298

BT - WebSci 2017 - Proceedings of the 2017 ACM Web Science Conference

T2 - 9th ACM Web Science Conference, WebSci 2017

Y2 - 25 June 2017 through 28 June 2017

ER -

Research@Leibniz University

Exploring Web Archives Through Temporal Anchor Texts

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions