Mining Relevant Time for Query Subtopics in Web Archives

Tu Ngoc Nguyen; Nattiya Kanhabua; Wolfgang Nejdl; Claudia Niederée

doi:10.1145/2740908.2741702

Details

Original language	English
Title of host publication	WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
Pages	1357-1362
Number of pages	6
ISBN (electronic)	9781450334730
Publication status	Published - 18 May 2015
Event	24th International Conference on World Wide Web, WWW 2015 - Florence, Italy Duration: 18 May 2015 → 22 May 2015

Abstract

With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

Keywords

Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Software

Cite this

Mining Relevant Time for Query Subtopics in Web Archives. / Nguyen, Tu Ngoc; Kanhabua, Nattiya; Nejdl, Wolfgang et al.
WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. p. 1357-1362.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Nguyen, TN, Kanhabua, N, Nejdl, W & Niederée, C 2015, Mining Relevant Time for Query Subtopics in Web Archives. in WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. pp. 1357-1362, 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, 18 May 2015. https://doi.org/10.1145/2740908.2741702

Nguyen, T. N., Kanhabua, N., Nejdl, W., & Niederée, C. (2015). Mining Relevant Time for Query Subtopics in Web Archives. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web (pp. 1357-1362) https://doi.org/10.1145/2740908.2741702

Nguyen TN, Kanhabua N, Nejdl W, Niederée C. Mining Relevant Time for Query Subtopics in Web Archives. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. p. 1357-1362 doi: 10.1145/2740908.2741702

Nguyen, Tu Ngoc ; Kanhabua, Nattiya ; Nejdl, Wolfgang et al. / Mining Relevant Time for Query Subtopics in Web Archives. WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. pp. 1357-1362

Download

@inproceedings{dd4b9c61a08547cd973ad4996e20af72,

title = "Mining Relevant Time for Query Subtopics in Web Archives",

abstract = "With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.",

keywords = "Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic",

author = "Nguyen, {Tu Ngoc} and Nattiya Kanhabua and Wolfgang Nejdl and Claudia Nieder{\'e}e",

note = "Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.; 24th International Conference on World Wide Web, WWW 2015 ; Conference date: 18-05-2015 Through 22-05-2015",

year = "2015",

month = may,

day = "18",

doi = "10.1145/2740908.2741702",

language = "English",

pages = "1357--1362",

booktitle = "WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web",

}

Download

TY - GEN

T1 - Mining Relevant Time for Query Subtopics in Web Archives

AU - Nguyen, Tu Ngoc

AU - Kanhabua, Nattiya

AU - Nejdl, Wolfgang

AU - Niederée, Claudia

N1 - Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.

PY - 2015/5/18

Y1 - 2015/5/18

N2 - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

AB - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

KW - Anchor TextMining

KW - Result Diversification

KW - Temporal Rank- ing

KW - Temporal Subtopic

UR - http://www.scopus.com/inward/record.url?scp=84968571752&partnerID=8YFLogxK

U2 - 10.1145/2740908.2741702

DO - 10.1145/2740908.2741702

M3 - Conference contribution

AN - SCOPUS:84968571752

SP - 1357

EP - 1362

BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

T2 - 24th International Conference on World Wide Web, WWW 2015

Y2 - 18 May 2015 through 22 May 2015

ER -

Research@Leibniz University

Mining Relevant Time for Query Subtopics in Web Archives

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets