Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 |
Herausgeber (Verlag) | Institute of Electrical and Electronics Engineers Inc. |
Seiten | 104-114 |
Seitenumfang | 11 |
ISBN (elektronisch) | 9781665481069 |
Publikationsstatus | Veröffentlicht - 2022 |
Extern publiziert | Ja |
Veranstaltung | 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022 - Virtual, Online, Frankreich Dauer: 30 Mai 2022 → 3 Juni 2022 |
Publikationsreihe
Name | Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 |
---|
Abstract
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Computernetzwerke und -kommunikation
- Informatik (insg.)
- Hardware und Architektur
- Informatik (insg.)
- Angewandte Informatik
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022. Institute of Electrical and Electronics Engineers Inc., 2022. S. 104-114 (Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - In-Memory Indexed Caching for Distributed Data Processing
AU - Uta, Alexandru
AU - Ghit, Bogdan
AU - Dave, Ankur
AU - Rellermeyer, Jan
AU - Boncz, Peter
N1 - Funding Information: ACKNOWLEDGEMENTS Part of this work was conducted while the first author was an intern at Databricks. We would like to thank Herman van Hovell, Adrian Ionescu for their suggestions on the implementation of the project, as well as Matei Zaharia for his valuable comments on the manuscript of the paper. The work in this article was in part supported by The Dutch National Science Foundation NWO Veni grant VI.202.195.
PY - 2022
Y1 - 2022
N2 - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
AB - Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead.
KW - cs.DC
UR - http://www.scopus.com/inward/record.url?scp=85136337448&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2112.06280
DO - 10.48550/arXiv.2112.06280
M3 - Conference contribution
T3 - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
SP - 104
EP - 114
BT - Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 36th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2022
Y2 - 30 May 2022 through 3 June 2022
ER -