Mining Relevant Time for Query Subtopics in Web Archives

Tu Ngoc Nguyen; Nattiya Kanhabua; Wolfgang Nejdl; Claudia Niederée

doi:10.1145/2740908.2741702

Details

Originalsprache	Englisch
Titel des Sammelwerks	WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
Seiten	1357-1362
Seitenumfang	6
ISBN (elektronisch)	9781450334730
Publikationsstatus	Veröffentlicht - 18 Mai 2015
Veranstaltung	24th International Conference on World Wide Web, WWW 2015 - Florence, Italien Dauer: 18 Mai 2015 → 22 Mai 2015

Abstract

With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Informatik (insg.)
Software

Zitieren

Mining Relevant Time for Query Subtopics in Web Archives. / Nguyen, Tu Ngoc; Kanhabua, Nattiya; Nejdl, Wolfgang et al.
WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. S. 1357-1362.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Nguyen, TN, Kanhabua, N, Nejdl, W & Niederée, C 2015, Mining Relevant Time for Query Subtopics in Web Archives. in WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. S. 1357-1362, 24th International Conference on World Wide Web, WWW 2015, Florence, Italien, 18 Mai 2015. https://doi.org/10.1145/2740908.2741702

Nguyen, T. N., Kanhabua, N., Nejdl, W., & Niederée, C. (2015). Mining Relevant Time for Query Subtopics in Web Archives. In WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web (S. 1357-1362) https://doi.org/10.1145/2740908.2741702

Nguyen TN, Kanhabua N, Nejdl W, Niederée C. Mining Relevant Time for Query Subtopics in Web Archives. in WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. S. 1357-1362 doi: 10.1145/2740908.2741702

Nguyen, Tu Ngoc ; Kanhabua, Nattiya ; Nejdl, Wolfgang et al. / Mining Relevant Time for Query Subtopics in Web Archives. WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. S. 1357-1362

Download

@inproceedings{dd4b9c61a08547cd973ad4996e20af72,

title = "Mining Relevant Time for Query Subtopics in Web Archives",

abstract = "With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.",

keywords = "Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic",

author = "Nguyen, {Tu Ngoc} and Nattiya Kanhabua and Wolfgang Nejdl and Claudia Nieder{\'e}e",

note = "Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.; 24th International Conference on World Wide Web, WWW 2015 ; Conference date: 18-05-2015 Through 22-05-2015",

year = "2015",

month = may,

day = "18",

doi = "10.1145/2740908.2741702",

language = "English",

pages = "1357--1362",

booktitle = "WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web",

}

Download

TY - GEN

T1 - Mining Relevant Time for Query Subtopics in Web Archives

AU - Nguyen, Tu Ngoc

AU - Kanhabua, Nattiya

AU - Nejdl, Wolfgang

AU - Niederée, Claudia

N1 - Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.

PY - 2015/5/18

Y1 - 2015/5/18

N2 - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

AB - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.

KW - Anchor TextMining

KW - Result Diversification

KW - Temporal Rank- ing

KW - Temporal Subtopic

UR - http://www.scopus.com/inward/record.url?scp=84968571752&partnerID=8YFLogxK

U2 - 10.1145/2740908.2741702

DO - 10.1145/2740908.2741702

M3 - Conference contribution

AN - SCOPUS:84968571752

SP - 1357

EP - 1362

BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web

T2 - 24th International Conference on World Wide Web, WWW 2015

Y2 - 18 May 2015 through 22 May 2015

ER -

Research@Leibniz University

Mining Relevant Time for Query Subtopics in Web Archives

Autoren

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions

An artificial intelligence-assisted clinical framework to facilitate diagnostics and translational discovery in hematologic neoplasia