Scalable Online-Offline Stream Clustering in Apache Spark

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Omar Backhoff
  • Eirini Ntoutsi

Organisationseinheiten

Externe Organisationen

  • Technische Universität München (TUM)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksProceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
Herausgeber/-innenCarlotta Domeniconi, Francesco Gullo, Francesco Bonchi, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
Herausgeber (Verlag)IEEE Computer Society
Seiten37-44
Seitenumfang8
ISBN (elektronisch)9781509054725
PublikationsstatusVeröffentlicht - 2 Juli 2016
Veranstaltung16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 - Barcelona, Spanien
Dauer: 12 Dez. 201615 Dez. 2016

Publikationsreihe

NameIEEE International Conference on Data Mining Workshops, ICDMW
Band0
ISSN (Print)2375-9232
ISSN (elektronisch)2375-9259

Abstract

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

ASJC Scopus Sachgebiete

Zitieren

Scalable Online-Offline Stream Clustering in Apache Spark. / Backhoff, Omar; Ntoutsi, Eirini.
Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. Hrsg. / Carlotta Domeniconi; Francesco Gullo; Francesco Bonchi; Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. IEEE Computer Society, 2016. S. 37-44 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Band 0).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Backhoff, O & Ntoutsi, E 2016, Scalable Online-Offline Stream Clustering in Apache Spark. in C Domeniconi, F Gullo, F Bonchi, F Bonchi, J Domingo-Ferrer, R Baeza-Yates, R Baeza-Yates, R Baeza-Yates, Z-H Zhou & X Wu (Hrsg.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016., 7836645, IEEE International Conference on Data Mining Workshops, ICDMW, Bd. 0, IEEE Computer Society, S. 37-44, 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016, Barcelona, Spanien, 12 Dez. 2016. https://doi.org/10.1109/ICDMW.2016.0014
Backhoff, O., & Ntoutsi, E. (2016). Scalable Online-Offline Stream Clustering in Apache Spark. In C. Domeniconi, F. Gullo, F. Bonchi, F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, R. Baeza-Yates, R. Baeza-Yates, Z.-H. Zhou, & X. Wu (Hrsg.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 (S. 37-44). Artikel 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Band 0). IEEE Computer Society. https://doi.org/10.1109/ICDMW.2016.0014
Backhoff O, Ntoutsi E. Scalable Online-Offline Stream Clustering in Apache Spark. in Domeniconi C, Gullo F, Bonchi F, Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Baeza-Yates R, Baeza-Yates R, Zhou ZH, Wu X, Hrsg., Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. IEEE Computer Society. 2016. S. 37-44. 7836645. (IEEE International Conference on Data Mining Workshops, ICDMW). doi: 10.1109/ICDMW.2016.0014
Backhoff, Omar ; Ntoutsi, Eirini. / Scalable Online-Offline Stream Clustering in Apache Spark. Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. Hrsg. / Carlotta Domeniconi ; Francesco Gullo ; Francesco Bonchi ; Francesco Bonchi ; Josep Domingo-Ferrer ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Zhi-Hua Zhou ; Xindong Wu. IEEE Computer Society, 2016. S. 37-44 (IEEE International Conference on Data Mining Workshops, ICDMW).
Download
@inproceedings{a0a83325c37d4bad9ca2887826fbdc08,
title = "Scalable Online-Offline Stream Clustering in Apache Spark",
abstract = "Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.",
keywords = "Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining",
author = "Omar Backhoff and Eirini Ntoutsi",
year = "2016",
month = jul,
day = "2",
doi = "10.1109/ICDMW.2016.0014",
language = "English",
series = "IEEE International Conference on Data Mining Workshops, ICDMW",
publisher = "IEEE Computer Society",
pages = "37--44",
editor = "Carlotta Domeniconi and Francesco Gullo and Francesco Bonchi and Francesco Bonchi and Josep Domingo-Ferrer and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Zhi-Hua Zhou and Xindong Wu",
booktitle = "Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016",
address = "United States",
note = "16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 ; Conference date: 12-12-2016 Through 15-12-2016",

}

Download

TY - GEN

T1 - Scalable Online-Offline Stream Clustering in Apache Spark

AU - Backhoff, Omar

AU - Ntoutsi, Eirini

PY - 2016/7/2

Y1 - 2016/7/2

N2 - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

AB - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

KW - Apache Spark

KW - Big data streams

KW - CluStream

KW - Stream clustering

KW - Stream mining

UR - http://www.scopus.com/inward/record.url?scp=85015208467&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2016.0014

DO - 10.1109/ICDMW.2016.0014

M3 - Conference contribution

AN - SCOPUS:85015208467

T3 - IEEE International Conference on Data Mining Workshops, ICDMW

SP - 37

EP - 44

BT - Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

A2 - Domeniconi, Carlotta

A2 - Gullo, Francesco

A2 - Bonchi, Francesco

A2 - Bonchi, Francesco

A2 - Domingo-Ferrer, Josep

A2 - Baeza-Yates, Ricardo

A2 - Baeza-Yates, Ricardo

A2 - Baeza-Yates, Ricardo

A2 - Zhou, Zhi-Hua

A2 - Wu, Xindong

PB - IEEE Computer Society

T2 - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

Y2 - 12 December 2016 through 15 December 2016

ER -