In-Memory Indexed Caching for Distributed Data Processing

Alexandru Uta; Bogdan Ghit; Ankur Dave; Jan Rellermeyer; Peter Boncz

doi:10.48550/arXiv.2112.06280

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
Herausgeber (Verlag)	Institute of Electrical and Electronics Engineers Inc.
Seiten	104-114
Seitenumfang	11
ISBN (elektronisch)	9781665481069
Publikationsstatus	Veröffentlicht - 2022
Extern publiziert	Ja
Veranstaltung	36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, Frankreich Dauer: 30 Mai 2022 → 3 Juni 2022

Publikationsreihe

Name	Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Abstract

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Informatik (insg.)
Hardware und Architektur
Informatik (insg.)
Angewandte Informatik

Zitieren

In-Memory Indexed Caching for Distributed Data Processing. / Uta, Alexandru; Ghit, Bogdan; Dave, Ankur et al.
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. S. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Uta, A, Ghit, B, Dave, A, Rellermeyer, J & Boncz, P 2022, In-Memory Indexed Caching for Distributed Data Processing. in Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022, Institute of Electrical and Electronics Engineers Inc., S. 104-114, 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Virtual, Online, Frankreich, 30 Mai 2022. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019

Uta, A., Ghit, B., Dave, A., Rellermeyer, J., & Boncz, P. (2022). In-Memory Indexed Caching for Distributed Data Processing. In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 (S. 104-114). (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019

Uta A, Ghit B, Dave A, Rellermeyer J, Boncz P. In-Memory Indexed Caching for Distributed Data Processing. in Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc. 2022. S. 104-114. (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). doi: 10.48550/arXiv.2112.06280, 10.1109/IPDPS53621.2022.00019

Uta, Alexandru ; Ghit, Bogdan ; Dave, Ankur et al. / In-Memory Indexed Caching for Distributed Data Processing. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. S. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).

Download

@inproceedings{773e3387a89043588cccad8ef5cec87e,

title = "In-Memory Indexed Caching for Distributed Data Processing",

abstract = " Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead. ",

keywords = "cs.DC",

author = "Alexandru Uta and Bogdan Ghit and Ankur Dave and Jan Rellermeyer and Peter Boncz",

note = "Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195. ; 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 ; Conference date: 30-05-2022 Through 03-06-2022",

year = "2022",

doi = "10.48550/arXiv.2112.06280",

language = "English",

series = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "104--114",

booktitle = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",

address = "United States",

}

Download

TY - GEN

T1 - In-Memory Indexed Caching for Distributed Data Processing

AU - Uta, Alexandru

AU - Ghit, Bogdan

AU - Dave, Ankur

AU - Rellermeyer, Jan

AU - Boncz, Peter

N1 - Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195.

PY - 2022

Y1 - 2022

N2 - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

AB - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

KW - cs.DC

UR - http://www.scopus.com/inward/record.url?scp=85136337448&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2112.06280

DO - 10.48550/arXiv.2112.06280

M3 - Conference contribution

T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

SP - 104

EP - 114

BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022

Y2 - 30 May 2022 through 3 June 2022

ER -

Research@Leibniz University

In-Memory Indexed Caching for Distributed Data Processing

Autoren

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Brug: An Adaptive Memory (Re-)Allocator

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

Toward Competitive Serverless Deep Learning

The Performance of Distributed Applications: A Traffic Shaping Perspective

Log Parsing Evaluation in the Era of Modern Software Systems