Toward Competitive Serverless Deep Learning

Stefan Petrescu; Diego Albo Martinez; Jan S. Rellermeyer

doi:10.1145/3631310.3633489

Details

Original language	English
Title of host publication	DICG 2023
Subtitle of host publication	MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of
Pages	7-12
Number of pages	6
ISBN (electronic)	9798400704581
Publication status	Published - 19 Jan 2024
Event	4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023 - Bologna, Italy Duration: 11 Dec 2023 → 15 Dec 2023

Abstract

Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.

Keywords

GPU acceleration, Machine Learning, Serverless

ASJC Scopus subject areas

Computer Science(all)
Information Systems
Computer Science(all)
Software

Cite this

Toward Competitive Serverless Deep Learning. / Petrescu, Stefan; Martinez, Diego Albo; Rellermeyer, Jan S.
DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of. 2024. p. 7-12.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Petrescu, S, Martinez, DA & Rellermeyer, JS 2024, Toward Competitive Serverless Deep Learning. in DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of. pp. 7-12, 4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023, Bologna, Italy, 11 Dec 2023. https://doi.org/10.1145/3631310.3633489

Petrescu, S., Martinez, D. A., & Rellermeyer, J. S. (2024). Toward Competitive Serverless Deep Learning. In DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of (pp. 7-12) https://doi.org/10.1145/3631310.3633489

Petrescu S, Martinez DA, Rellermeyer JS. Toward Competitive Serverless Deep Learning. In DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of. 2024. p. 7-12 doi: 10.1145/3631310.3633489

Petrescu, Stefan ; Martinez, Diego Albo ; Rellermeyer, Jan S. / Toward Competitive Serverless Deep Learning. DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of. 2024. pp. 7-12

Download

@inproceedings{5d3e55a2f1c64704b83c469b453ca2bc,

title = "Toward Competitive Serverless Deep Learning",

abstract = "Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.",

keywords = "GPU acceleration, Machine Learning, Serverless",

author = "Stefan Petrescu and Martinez, {Diego Albo} and Rellermeyer, {Jan S.}",

year = "2024",

month = jan,

day = "19",

doi = "10.1145/3631310.3633489",

language = "English",

pages = "7--12",

booktitle = "DICG 2023",

note = "4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023 ; Conference date: 11-12-2023 Through 15-12-2023",

}

Download

TY - GEN

T1 - Toward Competitive Serverless Deep Learning

AU - Petrescu, Stefan

AU - Martinez, Diego Albo

AU - Rellermeyer, Jan S.

PY - 2024/1/19

Y1 - 2024/1/19

N2 - Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.

AB - Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.

KW - GPU acceleration

KW - Machine Learning

KW - Serverless

UR - http://www.scopus.com/inward/record.url?scp=85185833424&partnerID=8YFLogxK

U2 - 10.1145/3631310.3633489

DO - 10.1145/3631310.3633489

M3 - Conference contribution

AN - SCOPUS:85185833424

SP - 7

EP - 12

BT - DICG 2023

T2 - 4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023

Y2 - 11 December 2023 through 15 December 2023

ER -

Research@Leibniz University

Toward Competitive Serverless Deep Learning

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

The Performance of Distributed Applications: A Traffic Shaping Perspective

Log Parsing Evaluation in the Era of Modern Software Systems

Maintaining and Monitoring AIOps Models Against Concept Drift

Brug: An Adaptive Memory (Re-)Allocator

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

The Performance of Distributed Applications: A Traffic Shaping Perspective

Log Parsing Evaluation in the Era of Modern Software Systems

Maintaining and Monitoring AIOps Models Against Concept Drift

Brug: An Adaptive Memory (Re-)Allocator

Is Your Anomaly Detector Ready for Change? Adapting AIOps Solutions to the Real World

The Performance of Distributed Applications: A Traffic Shaping Perspective