Details
Original language | English |
---|---|
Title of host publication | DICG 2023 |
Subtitle of host publication | MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of |
Pages | 7-12 |
Number of pages | 6 |
ISBN (electronic) | 9798400704581 |
Publication status | Published - 19 Jan 2024 |
Event | 4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023 - Bologna, Italy Duration: 11 Dec 2023 → 15 Dec 2023 |
Abstract
Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.
Keywords
- GPU acceleration, Machine Learning, Serverless
ASJC Scopus subject areas
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Software
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
DICG 2023: MIDDLEWARE 2023Proceedings of the 2023 4th International Workshop on Distributed Infrastructure for Common Good, Part of. 2024. p. 7-12.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Toward Competitive Serverless Deep Learning
AU - Petrescu, Stefan
AU - Martinez, Diego Albo
AU - Rellermeyer, Jan S.
PY - 2024/1/19
Y1 - 2024/1/19
N2 - Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.
AB - Machine learning is becoming a key technology to make systems smarter and more powerful. Unfortunately, training large and capable ML models is resource-intensive and requires high operational skills. Serverless computing is an emerging paradigm for structuring applications to benefit from on-demand computing resources and achieve horizontal scalability while making resources easier to consume. As such, it is an ideal substrate for the resource-intensive and often ad-hoc task of training deep learning models and has a strong potential to democratize access to ML techniques. However, the design of serverless platforms makes deep learning training difficult to translate efficiently to this new world. Apart from the intrinsic communication overhead (serverless functions are stateless), serverless training is limited by the reduced access to GPUs, which is especially problematic for running deep learning workloads, known to be notoriously demanding. To address these limitations, we present KubeML, a purpose-built deep learning system for serverless computing. KubeML fully embraces GPU acceleration while reducing the inherent communication overhead of deep learning workloads to match the limited capabilities of the serverless paradigm. In our experiments, we are able to out-perform TensorFlow for smaller local batches, reach a 3.98x faster time-To-Accuracy in these cases, and maintain a 2.02x speedup for commonly benchmarked machine learning models like ResNet34.
KW - GPU acceleration
KW - Machine Learning
KW - Serverless
UR - http://www.scopus.com/inward/record.url?scp=85185833424&partnerID=8YFLogxK
U2 - 10.1145/3631310.3633489
DO - 10.1145/3631310.3633489
M3 - Conference contribution
AN - SCOPUS:85185833424
SP - 7
EP - 12
BT - DICG 2023
T2 - 4th International Workshop on Distributed Infrastructure for Common Good, DICG 2023
Y2 - 11 December 2023 through 15 December 2023
ER -