Scalable Online-Offline Stream Clustering in Apache Spark

Omar Backhoff; Eirini Ntoutsi

doi:10.1109/ICDMW.2016.0014

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
Herausgeber/-innen	Carlotta Domeniconi, Francesco Gullo, Francesco Bonchi, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
Herausgeber (Verlag)	IEEE Computer Society
Seiten	37-44
Seitenumfang	8
ISBN (elektronisch)	9781509054725
Publikationsstatus	Veröffentlicht - 2 Juli 2016
Veranstaltung	16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 - Barcelona, Spanien Dauer: 12 Dez. 2016 → 15 Dez. 2016

Publikationsreihe

Name	IEEE International Conference on Data Mining Workshops, ICDMW
Band	0
ISSN (Print)	2375-9232
ISSN (elektronisch)	2375-9259

Abstract

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

ASJC Scopus Sachgebiete

Informatik (insg.)
Angewandte Informatik
Informatik (insg.)
Software

Zitieren

Scalable Online-Offline Stream Clustering in Apache Spark. / Backhoff, Omar; Ntoutsi, Eirini.
Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. Hrsg. / Carlotta Domeniconi; Francesco Gullo; Francesco Bonchi; Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. IEEE Computer Society, 2016. S. 37-44 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Band 0).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Backhoff, O & Ntoutsi, E 2016, Scalable Online-Offline Stream Clustering in Apache Spark. in C Domeniconi, F Gullo, F Bonchi, F Bonchi, J Domingo-Ferrer, R Baeza-Yates, R Baeza-Yates, R Baeza-Yates, Z-H Zhou & X Wu (Hrsg.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016., 7836645, IEEE International Conference on Data Mining Workshops, ICDMW, Bd. 0, IEEE Computer Society, S. 37-44, 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016, Barcelona, Spanien, 12 Dez. 2016. https://doi.org/10.1109/ICDMW.2016.0014

Backhoff, O., & Ntoutsi, E. (2016). Scalable Online-Offline Stream Clustering in Apache Spark. In C. Domeniconi, F. Gullo, F. Bonchi, F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, R. Baeza-Yates, R. Baeza-Yates, Z.-H. Zhou, & X. Wu (Hrsg.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 (S. 37-44). Artikel 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Band 0). IEEE Computer Society. https://doi.org/10.1109/ICDMW.2016.0014

Backhoff O, Ntoutsi E. Scalable Online-Offline Stream Clustering in Apache Spark. in Domeniconi C, Gullo F, Bonchi F, Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Baeza-Yates R, Baeza-Yates R, Zhou ZH, Wu X, Hrsg., Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. IEEE Computer Society. 2016. S. 37-44. 7836645. (IEEE International Conference on Data Mining Workshops, ICDMW). doi: 10.1109/ICDMW.2016.0014

Backhoff, Omar ; Ntoutsi, Eirini. / Scalable Online-Offline Stream Clustering in Apache Spark. Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. Hrsg. / Carlotta Domeniconi ; Francesco Gullo ; Francesco Bonchi ; Francesco Bonchi ; Josep Domingo-Ferrer ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Zhi-Hua Zhou ; Xindong Wu. IEEE Computer Society, 2016. S. 37-44 (IEEE International Conference on Data Mining Workshops, ICDMW).

Download

@inproceedings{a0a83325c37d4bad9ca2887826fbdc08,

title = "Scalable Online-Offline Stream Clustering in Apache Spark",

abstract = "Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.",

keywords = "Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining",

author = "Omar Backhoff and Eirini Ntoutsi",

year = "2016",

month = jul,

day = "2",

doi = "10.1109/ICDMW.2016.0014",

language = "English",

series = "IEEE International Conference on Data Mining Workshops, ICDMW",

publisher = "IEEE Computer Society",

pages = "37--44",

editor = "Carlotta Domeniconi and Francesco Gullo and Francesco Bonchi and Francesco Bonchi and Josep Domingo-Ferrer and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Zhi-Hua Zhou and Xindong Wu",

booktitle = "Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016",

address = "United States",

note = "16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 ; Conference date: 12-12-2016 Through 15-12-2016",

}

Download

TY - GEN

T1 - Scalable Online-Offline Stream Clustering in Apache Spark

AU - Backhoff, Omar

AU - Ntoutsi, Eirini

PY - 2016/7/2

Y1 - 2016/7/2

N2 - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

AB - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

KW - Apache Spark

KW - Big data streams

KW - CluStream

KW - Stream clustering

KW - Stream mining

UR - http://www.scopus.com/inward/record.url?scp=85015208467&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2016.0014

DO - 10.1109/ICDMW.2016.0014

M3 - Conference contribution

AN - SCOPUS:85015208467

T3 - IEEE International Conference on Data Mining Workshops, ICDMW

SP - 37

EP - 44

BT - Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

A2 - Domeniconi, Carlotta

A2 - Gullo, Francesco

A2 - Bonchi, Francesco

A2 - Domingo-Ferrer, Josep

A2 - Baeza-Yates, Ricardo

A2 - Zhou, Zhi-Hua

A2 - Wu, Xindong

PB - IEEE Computer Society

T2 - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

Y2 - 12 December 2016 through 15 December 2016

ER -

Research@Leibniz University

Scalable Online-Offline Stream Clustering in Apache Spark

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren