Scalable Online-Offline Stream Clustering in Apache Spark

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Omar Backhoff
  • Eirini Ntoutsi

Research Organisations

External Research Organisations

  • Technical University of Munich (TUM)
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016
EditorsCarlotta Domeniconi, Francesco Gullo, Francesco Bonchi, Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
PublisherIEEE Computer Society
Pages37-44
Number of pages8
ISBN (electronic)9781509054725
Publication statusPublished - 2 Jul 2016
Event16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 - Barcelona, Spain
Duration: 12 Dec 201615 Dec 2016

Publication series

NameIEEE International Conference on Data Mining Workshops, ICDMW
Volume0
ISSN (Print)2375-9232
ISSN (electronic)2375-9259

Abstract

Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

Keywords

    Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining

ASJC Scopus subject areas

Cite this

Scalable Online-Offline Stream Clustering in Apache Spark. / Backhoff, Omar; Ntoutsi, Eirini.
Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. ed. / Carlotta Domeniconi; Francesco Gullo; Francesco Bonchi; Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. IEEE Computer Society, 2016. p. 37-44 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 0).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Backhoff, O & Ntoutsi, E 2016, Scalable Online-Offline Stream Clustering in Apache Spark. in C Domeniconi, F Gullo, F Bonchi, F Bonchi, J Domingo-Ferrer, R Baeza-Yates, R Baeza-Yates, R Baeza-Yates, Z-H Zhou & X Wu (eds), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016., 7836645, IEEE International Conference on Data Mining Workshops, ICDMW, vol. 0, IEEE Computer Society, pp. 37-44, 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016, Barcelona, Spain, 12 Dec 2016. https://doi.org/10.1109/ICDMW.2016.0014
Backhoff, O., & Ntoutsi, E. (2016). Scalable Online-Offline Stream Clustering in Apache Spark. In C. Domeniconi, F. Gullo, F. Bonchi, F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, R. Baeza-Yates, R. Baeza-Yates, Z.-H. Zhou, & X. Wu (Eds.), Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 (pp. 37-44). Article 7836645 (IEEE International Conference on Data Mining Workshops, ICDMW; Vol. 0). IEEE Computer Society. https://doi.org/10.1109/ICDMW.2016.0014
Backhoff O, Ntoutsi E. Scalable Online-Offline Stream Clustering in Apache Spark. In Domeniconi C, Gullo F, Bonchi F, Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Baeza-Yates R, Baeza-Yates R, Zhou ZH, Wu X, editors, Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. IEEE Computer Society. 2016. p. 37-44. 7836645. (IEEE International Conference on Data Mining Workshops, ICDMW). doi: 10.1109/ICDMW.2016.0014
Backhoff, Omar ; Ntoutsi, Eirini. / Scalable Online-Offline Stream Clustering in Apache Spark. Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016. editor / Carlotta Domeniconi ; Francesco Gullo ; Francesco Bonchi ; Francesco Bonchi ; Josep Domingo-Ferrer ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Ricardo Baeza-Yates ; Zhi-Hua Zhou ; Xindong Wu. IEEE Computer Society, 2016. pp. 37-44 (IEEE International Conference on Data Mining Workshops, ICDMW).
Download
@inproceedings{a0a83325c37d4bad9ca2887826fbdc08,
title = "Scalable Online-Offline Stream Clustering in Apache Spark",
abstract = "Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.",
keywords = "Apache Spark, Big data streams, CluStream, Stream clustering, Stream mining",
author = "Omar Backhoff and Eirini Ntoutsi",
year = "2016",
month = jul,
day = "2",
doi = "10.1109/ICDMW.2016.0014",
language = "English",
series = "IEEE International Conference on Data Mining Workshops, ICDMW",
publisher = "IEEE Computer Society",
pages = "37--44",
editor = "Carlotta Domeniconi and Francesco Gullo and Francesco Bonchi and Francesco Bonchi and Josep Domingo-Ferrer and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Ricardo Baeza-Yates and Zhi-Hua Zhou and Xindong Wu",
booktitle = "Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016",
address = "United States",
note = "16th IEEE International Conference on Data Mining Workshops, ICDMW 2016 ; Conference date: 12-12-2016 Through 15-12-2016",

}

Download

TY - GEN

T1 - Scalable Online-Offline Stream Clustering in Apache Spark

AU - Backhoff, Omar

AU - Ntoutsi, Eirini

PY - 2016/7/2

Y1 - 2016/7/2

N2 - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

AB - Two of the most popular approaches for dealing with big data are distributed computing and stream mining. In this paper, we incorporate both approaches in order to bring a competitive stream clustering algorithm, namely CluStream, into a modern framework for distributed computing, namely, Apache Spark. CluStream is one of the most popular clustering approaches for stream clustering and the one that introduced the online-offline mining process: The online phase summarizes the stream through statistical summaries and the offline phase generates the final clusters upon these summaries. We obtain a scalable stream clustering method which is open source and can be used by the Apache Spark community. Our experiments show that our adaptation, our achieves similar quality to the original approach, while it is more efficient.

KW - Apache Spark

KW - Big data streams

KW - CluStream

KW - Stream clustering

KW - Stream mining

UR - http://www.scopus.com/inward/record.url?scp=85015208467&partnerID=8YFLogxK

U2 - 10.1109/ICDMW.2016.0014

DO - 10.1109/ICDMW.2016.0014

M3 - Conference contribution

AN - SCOPUS:85015208467

T3 - IEEE International Conference on Data Mining Workshops, ICDMW

SP - 37

EP - 44

BT - Proceedings - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

A2 - Domeniconi, Carlotta

A2 - Gullo, Francesco

A2 - Bonchi, Francesco

A2 - Bonchi, Francesco

A2 - Domingo-Ferrer, Josep

A2 - Baeza-Yates, Ricardo

A2 - Baeza-Yates, Ricardo

A2 - Baeza-Yates, Ricardo

A2 - Zhou, Zhi-Hua

A2 - Wu, Xindong

PB - IEEE Computer Society

T2 - 16th IEEE International Conference on Data Mining Workshops, ICDMW 2016

Y2 - 12 December 2016 through 15 December 2016

ER -