Mining Relevant Time for Query Subtopics in Web Archives

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationWWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
Pages1357-1362
Number of pages6
ISBN (electronic)9781450334730
Publication statusPublished - 18 May 2015
Event24th International Conference on World Wide Web, WWW 2015 - Florence, Italy
Duration: 18 May 201522 May 2015

Abstract

With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

Keywords

    Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic

ASJC Scopus subject areas

Cite this

Mining Relevant Time for Query Subtopics in Web Archives. / Nguyen, Tu Ngoc; Kanhabua, Nattiya; Nejdl, Wolfgang et al.
WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. p. 1357-1362.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Nguyen, TN, Kanhabua, N, Nejdl, W & Niederée, C 2015, Mining Relevant Time for Query Subtopics in Web Archives. in WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. pp. 1357-1362, 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, 18 May 2015. https://doi.org/10.1145/2740908.2741702
Nguyen, T. N., Kanhabua, N., Nejdl, W., & Niederée, C. (2015). Mining Relevant Time for Query Subtopics in Web Archives. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web (pp. 1357-1362) https://doi.org/10.1145/2740908.2741702
Nguyen TN, Kanhabua N, Nejdl W, Niederée C. Mining Relevant Time for Query Subtopics in Web Archives. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. p. 1357-1362 doi: 10.1145/2740908.2741702
Nguyen, Tu Ngoc ; Kanhabua, Nattiya ; Nejdl, Wolfgang et al. / Mining Relevant Time for Query Subtopics in Web Archives. WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. pp. 1357-1362
Download
@inproceedings{dd4b9c61a08547cd973ad4996e20af72,
title = "Mining Relevant Time for Query Subtopics in Web Archives",
abstract = "With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.",
keywords = "Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic",
author = "Nguyen, {Tu Ngoc} and Nattiya Kanhabua and Wolfgang Nejdl and Claudia Nieder{\'e}e",
note = "Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.; 24th International Conference on World Wide Web, WWW 2015 ; Conference date: 18-05-2015 Through 22-05-2015",
year = "2015",
month = may,
day = "18",
doi = "10.1145/2740908.2741702",
language = "English",
pages = "1357--1362",
booktitle = "WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web",

}

Download

TY - GEN

T1 - Mining Relevant Time for Query Subtopics in Web Archives

AU - Nguyen, Tu Ngoc

AU - Kanhabua, Nattiya

AU - Nejdl, Wolfgang

AU - Niederée, Claudia

N1 - Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.

PY - 2015/5/18

Y1 - 2015/5/18

N2 - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

AB - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

KW - Anchor TextMining

KW - Result Diversification

KW - Temporal Rank- ing

KW - Temporal Subtopic

UR - http://www.scopus.com/inward/record.url?scp=84968571752&partnerID=8YFLogxK

U2 - 10.1145/2740908.2741702

DO - 10.1145/2740908.2741702

M3 - Conference contribution

AN - SCOPUS:84968571752

SP - 1357

EP - 1362

BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

T2 - 24th International Conference on World Wide Web, WWW 2015

Y2 - 18 May 2015 through 22 May 2015

ER -

By the same author(s)