Details
Original language | English |
---|---|
Title of host publication | WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web |
Pages | 1357-1362 |
Number of pages | 6 |
ISBN (electronic) | 9781450334730 |
Publication status | Published - 18 May 2015 |
Event | 24th International Conference on World Wide Web, WWW 2015 - Florence, Italy Duration: 18 May 2015 → 22 May 2015 |
Abstract
With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.
Keywords
- Anchor TextMining, Result Diversification, Temporal Rank- ing, Temporal Subtopic
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web. 2015. p. 1357-1362.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Mining Relevant Time for Query Subtopics in Web Archives
AU - Nguyen, Tu Ngoc
AU - Kanhabua, Nattiya
AU - Nejdl, Wolfgang
AU - Niederée, Claudia
N1 - Funding information: The work was partially funded by the European Commission for the ERC Advanced Grant ALEXAN- DRIA under grant No. 339233 and the FP7 project ForgetIT under grant No. 600826.
PY - 2015/5/18
Y1 - 2015/5/18
N2 - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.
AB - With the reflection of nearly all types of social cultural, so- cietal and everyday processes of our lives in the web, web archives from organizations such as the Internet Archive have the potential of becoming huge gold-mines for tempo- ral content analytics of many kinds (e.g., on politics, social issues, economics or media). First hand evidences for such processes are of great bene t for expert users such as journal- ists, economists, historians, etc. However, searching in this unique longitudinal collection of huge redundancy (pages of near-identical content are crawled all over again) is com- pletely different from searching over the web. In this work, we present our first study of mining the temporal dynamics of subtopics by leveraging the value of anchor text along the time dimension of the enormous web archives. This task is especially useful for one important ranking problem in the web archive context, the time-aware search result diversi- cation. Due to the time uncertainty (the lagging nature and unpredicted behavior of the crawlers), identifying the trending periods for such temporal subtopics relying solely on the timestamp annotations of the web archive (i.e., crawl- ing times) is extremely dificult. We introduce a brute-force approach to detect a time-reliable sub-collection and pro- pose a method to leverage them for relevant time mining of subtopics. This is empirically found effective in solving the problem.
KW - Anchor TextMining
KW - Result Diversification
KW - Temporal Rank- ing
KW - Temporal Subtopic
UR - http://www.scopus.com/inward/record.url?scp=84968571752&partnerID=8YFLogxK
U2 - 10.1145/2740908.2741702
DO - 10.1145/2740908.2741702
M3 - Conference contribution
AN - SCOPUS:84968571752
SP - 1357
EP - 1362
BT - WWW 2015 Companion - Proceedings of the 24th International Conference on World Wide Web
T2 - 24th International Conference on World Wide Web, WWW 2015
Y2 - 18 May 2015 through 22 May 2015
ER -