Details
Original language | English |
---|---|
Title of host publication | Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 |
Editors | Carlotta Domeniconi, Francesco Gullo, Francesco Bonchi, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu |
Publisher | IEEE Computer Society |
Pages | 37-44 |
Number of pages | 8 |
ISBN (electronic) | 9781509054725 |
Publication status | Published - 2 Jul 2016 |
Event | 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 - Barcelona, Spain Duration: 12 Dec 2016 → 15 Dec 2016 |
Publication series
Name | IEEE International Conference on Data Mining Workshops, ICDMW |
---|---|
Volume | 0 |
ISSN (Print) | 2375-9232 |
ISSN (electronic) | 2375-9259 |
Abstract
Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.
Keywords
- Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining
ASJC Scopus subject areas
- Computer Science(all)
- Computer Science Applications
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. ed. / Carlotta Domeniconi; Francesco Gullo; Francesco Bonchi; Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. IEEE Computer Society, 2016. p. 37-44 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 0).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Scalable Online-Offline Stream Clustering in Apache Spark
AU - Backhoff, Omar
AU - Ntoutsi, Eirini
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.
AB - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.
KW - Apache Spark
KW - Big data streams
KW - CluStream
KW - Stream clustering
KW - Stream mining
UR - http://www.scopus.com/inward/record.url?scp=85015208467&partnerID=8YFLogxK
U2 - 10.1109/ICDMW.2016.0014
DO - 10.1109/ICDMW.2016.0014
M3 - Conference contribution
AN - SCOPUS:85015208467
T3 - IEEE International Conference on Data Mining Workshops, ICDMW
SP - 37
EP - 44
BT - Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
A2 - Domeniconi, Carlotta
A2 - Gullo, Francesco
A2 - Bonchi, Francesco
A2 - Bonchi, Francesco
A2 - Domingo-Ferrer, Josep
A2 - Baeza-Yates, Ricardo
A2 - Baeza-Yates, Ricardo
A2 - Baeza-Yates, Ricardo
A2 - Zhou, Zhi-Hua
A2 - Wu, Xindong
PB - IEEE Computer Society
T2 - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
Y2 - 12 December 2016 through 15 December 2016
ER -