Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management |
Herausgeber/-innen | Norman Paton, Selcuk Candan, Haixun Wang, James Allan, Rakesh Agrawal, Alexandros Labrinidis, Alfredo Cuzzocrea, Mohammed Zaki, Divesh Srivastava, Andrei Broder, Assaf Schuster |
Herausgeber (Verlag) | Association for Computing Machinery (ACM) |
Seiten | 1847-1850 |
Seitenumfang | 4 |
ISBN (elektronisch) | 9781450360142 |
Publikationsstatus | Veröffentlicht - Okt. 2018 |
Veranstaltung | 27th ACM International Conference on Information and Knowledge Management, CIKM 2018 - Torino, Italien Dauer: 22 Okt. 2018 → 26 Okt. 2018 |
Abstract
Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.
ASJC Scopus Sachgebiete
- Betriebswirtschaft, Management und Rechnungswesen (insg.)
- Allgemeine Unternehmensführung und Buchhaltung
- Entscheidungswissenschaften (insg.)
- Allgemeine Entscheidungswissenschaften
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management. Hrsg. / Norman Paton; Selcuk Candan; Haixun Wang; James Allan; Rakesh Agrawal; Alexandros Labrinidis; Alfredo Cuzzocrea; Mohammed Zaki; Divesh Srivastava; Andrei Broder; Assaf Schuster. Association for Computing Machinery (ACM), 2018. S. 1847-1850.
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - W2E: A Worldwide-Event Benchmark Dataset for Topic Detection and Tracking
AU - Hoang, Tuan Anh
AU - Duy Vo, Khoi
AU - Nejdl, Wolfgang
N1 - Funding information: This research is supported by the ERC Grant (339233) ALEXANDRIA.
PY - 2018/10
Y1 - 2018/10
N2 - Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.
AB - Topic detection and tracking in document streams is a critical task in many important applications, hence has been attracting research interest in recent decades. With the large size of data streams, there have been a number of works from different approaches that propose automatic methods for the task. However, there is only a few small benchmark datasets that are publicly available for evaluating the proposed methods. The lack of large datasets with fine-grained groundtruth implicitly restrains the development of more advanced methods. In this work, we address this issue by collecting and publishing W2E - a large dataset consisting of news articles from more than 50 prominent mass media channels worldwide. The articles cover a large set of popular events within a full year. W2E is more than 15 times larger than TREC's TDT2 dataset, which is widely used in prior work. We further conduct exploratory analysis to examine the dynamics and diversity of W2E and propose potential uses of the dataset in other research.
KW - Benchmark dataset
KW - Topic detection
KW - Topic tracking
UR - http://www.scopus.com/inward/record.url?scp=85058027671&partnerID=8YFLogxK
U2 - 10.1145/3269206.3269309
DO - 10.1145/3269206.3269309
M3 - Conference contribution
AN - SCOPUS:85058027671
SP - 1847
EP - 1850
BT - CIKM 2018 - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
A2 - Paton, Norman
A2 - Candan, Selcuk
A2 - Wang, Haixun
A2 - Allan, James
A2 - Agrawal, Rakesh
A2 - Labrinidis, Alexandros
A2 - Cuzzocrea, Alfredo
A2 - Zaki, Mohammed
A2 - Srivastava, Divesh
A2 - Broder, Andrei
A2 - Schuster, Assaf
PB - Association for Computing Machinery (ACM)
T2 - 27th ACM International Conference on Information and Knowledge Management, CIKM 2018
Y2 - 22 October 2018 through 26 October 2018
ER -