W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publicationCIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
EditorsNorman Paton, Selcuk Candan, Haixun Wang, James Allan, Rakesh Agrawal, Alexandros Labrinidis, Alfredo Cuzzocrea, Mohammed Zaki, Divesh Srivastava, Andrei Broder, Assaf Schuster
PublisherAssociation for Computing Machinery (ACM)
Pages1847-1850
Number of pages4
ISBN (electronic)9781450360142
Publication statusPublished - Oct 2018
Event27th ACM International Conference on Information and Knowledge Management, CIKM 2018 - Torino, Italy
Duration: 22 Oct 201826 Oct 2018

Abstract

Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.

Keywords

    Benchmark dataset, Topic detection, Topic tracking

ASJC Scopus subject areas

Cite this

W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking. / Hoang, Tuan Anh; Duy Vo, Khoi; Nejdl, Wolfgang.
CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ed. / Norman Paton; Selcuk Candan; Haixun Wang; James Allan; Rakesh Agrawal; Alexandros Labrinidis; Alfredo Cuzzocrea; Mohammed Zaki; Divesh Srivastava; Andrei Broder; Assaf Schuster. Association for Computing Machinery (ACM), 2018. p. 1847-1850.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Hoang, TA, Duy Vo, K & Nejdl, W 2018, W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking. in N Paton, S Candan, H Wang, J Allan, R Agrawal, A Labrinidis, A Cuzzocrea, M Zaki, D Srivastava, A Broder & A Schuster (eds), CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM), pp. 1847-1850, 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, 22 Oct 2018. https://doi.org/10.1145/3269206.3269309
Hoang, T. A., Duy Vo, K., & Nejdl, W. (2018). W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking. In N. Paton, S. Candan, H. Wang, J. Allan, R. Agrawal, A. Labrinidis, A. Cuzzocrea, M. Zaki, D. Srivastava, A. Broder, & A. Schuster (Eds.), CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management (pp. 1847-1850). Association for Computing Machinery (ACM). https://doi.org/10.1145/3269206.3269309
Hoang TA, Duy Vo K, Nejdl W. W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking. In Paton N, Candan S, Wang H, Allan J, Agrawal R, Labrinidis A, Cuzzocrea A, Zaki M, Srivastava D, Broder A, Schuster A, editors, CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Association for Computing Machinery (ACM). 2018. p. 1847-1850 doi: 10.1145/3269206.3269309
Hoang, Tuan Anh ; Duy Vo, Khoi ; Nejdl, Wolfgang. / W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking. CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management. editor / Norman Paton ; Selcuk Candan ; Haixun Wang ; James Allan ; Rakesh Agrawal ; Alexandros Labrinidis ; Alfredo Cuzzocrea ; Mohammed Zaki ; Divesh Srivastava ; Andrei Broder ; Assaf Schuster. Association for Computing Machinery (ACM), 2018. pp. 1847-1850
Download
@inproceedings{52d04e4442cf45fd8e8956d60b54fc99,
title = "W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking",
abstract = "Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.",
keywords = "Benchmark dataset, Topic detection, Topic tracking",
author = "Hoang, {Tuan Anh} and {Duy Vo}, Khoi and Wolfgang Nejdl",
note = "Funding information: This research is supported by the ERC Grant (339233) ALEXANDRIA.; 27th ACM International Conference on Information and Knowledge Management, CIKM 2018 ; Conference date: 22-10-2018 Through 26-10-2018",
year = "2018",
month = oct,
doi = "10.1145/3269206.3269309",
language = "English",
pages = "1847--1850",
editor = "Norman Paton and Selcuk Candan and Haixun Wang and James Allan and Rakesh Agrawal and Alexandros Labrinidis and Alfredo Cuzzocrea and Mohammed Zaki and Divesh Srivastava and Andrei Broder and Assaf Schuster",
booktitle = "CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Download

TY - GEN

T1 - W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking

AU - Hoang, Tuan Anh

AU - Duy Vo, Khoi

AU - Nejdl, Wolfgang

N1 - Funding information: This research is supported by the ERC Grant (339233) ALEXANDRIA.

PY - 2018/10

Y1 - 2018/10

N2 - Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.

AB - Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.

KW - Benchmark dataset

KW - Topic detection

KW - Topic tracking

UR - http://www.scopus.com/inward/record.url?scp=85058027671&partnerID=8YFLogxK

U2 - 10.1145/3269206.3269309

DO - 10.1145/3269206.3269309

M3 - Conference contribution

AN - SCOPUS:85058027671

SP - 1847

EP - 1850

BT - CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management

A2 - Paton, Norman

A2 - Candan, Selcuk

A2 - Wang, Haixun

A2 - Allan, James

A2 - Agrawal, Rakesh

A2 - Labrinidis, Alexandros

A2 - Cuzzocrea, Alfredo

A2 - Zaki, Mohammed

A2 - Srivastava, Divesh

A2 - Broder, Andrei

A2 - Schuster, Assaf

PB - Association for Computing Machinery (ACM)

T2 - 27th ACM International Conference on Information and Knowledge Management, CIKM 2018

Y2 - 22 October 2018 through 26 October 2018

ER -

By the same author(s)