Scalable Online-Offline Stream Clustering in Apache Spark

Omar Backhoff; Eirini Ntoutsi

doi:10.1109/ICDMW.2016.0014

Details

Original language	English
Title of host publication	Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
Editors	Carlotta Domeniconi, Francesco Gullo, Francesco Bonchi, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
Publisher	IEEE Computer Society
Pages	37-44
Number of pages	8
ISBN (electronic)	9781509054725
Publication status	Published - 2 Jul 2016
Event	16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 - Barcelona, Spain Duration: 12 Dec 2016 → 15 Dec 2016

Publication series

Name	IEEE International Conference on Data Mining Workshops, ICDMW
Volume	0
ISSN (Print)	2375-9232
ISSN (electronic)	2375-9259

Abstract

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

Keywords

Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining

ASJC Scopus subject areas

Computer Science(all)
Computer Science Applications
Computer Science(all)
Software

Cite this

Scalable Online-Offline Stream Clustering in Apache Spark. / Backhoff, Omar; Ntoutsi, Eirini.
Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. ed. / Carlotta Domeniconi; Francesco Gullo; Francesco Bonchi; Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. IEEE Computer Society, 2016. p. 37-44 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 0).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Backhoff, O & Ntoutsi, E 2016, Scalable Online-Offline Stream Clustering in Apache Spark. in C Domeniconi, F Gullo, F Bonchi, F Bonchi, J Domingo-Ferrer, R Baeza-Yates, R Baeza-Yates, R Baeza-Yates, Z-H Zhou & X Wu (eds), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016., 7836645, IEEE International Conference on Data Mining Workshops, ICDMW, vol. 0, IEEE Computer Society, pp. 37-44, 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016, Barcelona, Spain, 12 Dec 2016. https://doi.org/10.1109/ICDMW.2016.0014

Backhoff, O., & Ntoutsi, E. (2016). Scalable Online-Offline Stream Clustering in Apache Spark. In C. Domeniconi, F. Gullo, F. Bonchi, F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, R. Baeza-Yates, R. Baeza-Yates, Z.-H. Zhou, & X. Wu (Eds.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 (pp. 37-44). Article 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 0). IEEE Computer Society. https://doi.org/10.1109/ICDMW.2016.0014

Backhoff O, Ntoutsi E. Scalable Online-Offline Stream Clustering in Apache Spark. In Domeniconi C, Gullo F, Bonchi F, Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Baeza-Yates R, Baeza-Yates R, Zhou ZH, Wu X, editors, Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. IEEE Computer Society. 2016. p. 37-44. 7836645. (IEEE International Conference on Data Mining Workshops, ICDMW). doi: 10.1109/ICDMW.2016.0014

Backhoff, Omar ; Ntoutsi, Eirini. / Scalable Online-Offline Stream Clustering in Apache Spark. Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. editor / Carlotta Domeniconi ; Francesco Gullo ; Francesco Bonchi ; Francesco Bonchi ; Josep Domingo-Ferrer ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Zhi-Hua Zhou ; Xindong Wu. IEEE Computer Society, 2016. pp. 37-44 (IEEE International Conference on Data Mining Workshops, ICDMW).

Download

@inproceedings{a0a83325c37d4bad9ca2887826fbdc08,

title = "Scalable Online-Offline Stream Clustering in Apache Spark",

abstract = "Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.",

keywords = "Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining",

author = "Omar Backhoff and Eirini Ntoutsi",

year = "2016",

month = jul,

day = "2",

doi = "10.1109/ICDMW.2016.0014",

language = "English",

series = "IEEE International Conference on Data Mining Workshops, ICDMW",

publisher = "IEEE Computer Society",

pages = "37--44",

editor = "Carlotta Domeniconi and Francesco Gullo and Francesco Bonchi and Francesco Bonchi and Josep Domingo-Ferrer and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Zhi-Hua Zhou and Xindong Wu",

booktitle = "Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016",

address = "United States",

note = "16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 ; Conference date: 12-12-2016 Through 15-12-2016",

}

Download

TY - GEN

T1 - Scalable Online-Offline Stream Clustering in Apache Spark

AU - Backhoff, Omar

AU - Ntoutsi, Eirini

PY - 2016/7/2

Y1 - 2016/7/2

N2 - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

AB - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

KW - Apache Spark

KW - Big data streams

KW - CluStream

KW - Stream clustering

KW - Stream mining

UR - http://www.scopus.com/inward/record.url?scp=85015208467&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2016.0014

DO - 10.1109/ICDMW.2016.0014

M3 - Conference contribution

AN - SCOPUS:85015208467

T3 - IEEE International Conference on Data Mining Workshops, ICDMW

SP - 37

EP - 44

BT - Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

A2 - Domeniconi, Carlotta

A2 - Gullo, Francesco

A2 - Bonchi, Francesco

A2 - Domingo-Ferrer, Josep

A2 - Baeza-Yates, Ricardo

A2 - Zhou, Zhi-Hua

A2 - Wu, Xindong

PB - IEEE Computer Society

T2 - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

Y2 - 12 December 2016 through 15 December 2016

ER -

Research@Leibniz University

Scalable Online-Offline Stream Clustering in Apache Spark

Authors

Research Organisations

External Research Organisations