In-Memory Indexed Caching for Distributed Data Processing

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

Externe Organisationen

  • Leiden University
  • Databricks
  • University of California at Berkeley
  • Delft University of Technology
  • Centrum voor Wiskunde en Informatica
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seiten104-114
Seitenumfang11
ISBN (elektronisch)9781665481069
PublikationsstatusVeröffentlicht - 2022
Extern publiziertJa
Veranstaltung36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, Frankreich
Dauer: 30 Mai 20223 Juni 2022

Publikationsreihe

NameProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Abstract

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

ASJC Scopus Sachgebiete

Zitieren

In-Memory Indexed Caching for Distributed Data Processing. / Uta, Alexandru; Ghit, Bogdan; Dave, Ankur et al.
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. S. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Uta, A, Ghit, B, Dave, A, Rellermeyer, J & Boncz, P 2022, In-Memory Indexed Caching for Distributed Data Processing. in Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022, Institute of Electrical and Electronics Engineers Inc., S. 104-114, 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022, Virtual, Online, Frankreich, 30 Mai 2022. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019
Uta, A., Ghit, B., Dave, A., Rellermeyer, J., & Boncz, P. (2022). In-Memory Indexed Caching for Distributed Data Processing. In Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 (S. 104-114). (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2112.06280, https://doi.org/10.1109/IPDPS53621.2022.00019
Uta A, Ghit B, Dave A, Rellermeyer J, Boncz P. In-Memory Indexed Caching for Distributed Data Processing. in Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc. 2022. S. 104-114. (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022). doi: 10.48550/arXiv.2112.06280, 10.1109/IPDPS53621.2022.00019
Uta, Alexandru ; Ghit, Bogdan ; Dave, Ankur et al. / In-Memory Indexed Caching for Distributed Data Processing. Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. S. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).
Download
@inproceedings{773e3387a89043588cccad8ef5cec87e,
title = "In-Memory Indexed Caching for Distributed Data Processing",
abstract = " Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead. ",
keywords = "cs.DC",
author = "Alexandru Uta and Bogdan Ghit and Ankur Dave and Jan Rellermeyer and Peter Boncz",
note = "Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195. ; 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 ; Conference date: 30-05-2022 Through 03-06-2022",
year = "2022",
doi = "10.48550/arXiv.2112.06280",
language = "English",
series = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "104--114",
booktitle = "Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022",
address = "United States",

}

Download

TY - GEN

T1 - In-Memory Indexed Caching for Distributed Data Processing

AU - Uta, Alexandru

AU - Ghit, Bogdan

AU - Dave, Ankur

AU - Rellermeyer, Jan

AU - Boncz, Peter

N1 - Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195.

PY - 2022

Y1 - 2022

N2 - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

AB - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.

KW - cs.DC

UR - http://www.scopus.com/inward/record.url?scp=85136337448&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2112.06280

DO - 10.48550/arXiv.2112.06280

M3 - Conference contribution

T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

SP - 104

EP - 114

BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022

Y2 - 30 May 2022 through 3 June 2022

ER -

Von denselben Autoren