Optimizing Machine Learning Workloads in Collaborative Environments

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Behrouz Derakhshan
  • Alireza Mahdiraji
  • Ziawasch Abedjan
  • Tilmann Rabl
  • Volker Markl

External Research Organisations

  • German Research Centre for Artificial Intelligence (DFKI)
  • Technische Universität Berlin
  • University of Potsdam
View graph of relations

Details

Original languageEnglish
Title of host publicationProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020
Pages1701-1716
Number of pages16
ISBN (electronic)9781450367356
Publication statusPublished - Jun 2020
Externally publishedYes

Abstract

Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.

Keywords

    collaborative ML, machine learning, materialization and reuse

ASJC Scopus subject areas

Cite this

Optimizing Machine Learning Workloads in Collaborative Environments. / Derakhshan, Behrouz; Mahdiraji, Alireza; Abedjan, Ziawasch et al.
Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. 2020. p. 1701-1716.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Derakhshan, B, Mahdiraji, A, Abedjan, Z, Rabl, T & Markl, V 2020, Optimizing Machine Learning Workloads in Collaborative Environments. in Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. pp. 1701-1716. https://doi.org/10.1145/3318464.3389715
Derakhshan, B., Mahdiraji, A., Abedjan, Z., Rabl, T., & Markl, V. (2020). Optimizing Machine Learning Workloads in Collaborative Environments. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020 (pp. 1701-1716) https://doi.org/10.1145/3318464.3389715
Derakhshan B, Mahdiraji A, Abedjan Z, Rabl T, Markl V. Optimizing Machine Learning Workloads in Collaborative Environments. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. 2020. p. 1701-1716 doi: 10.1145/3318464.3389715
Derakhshan, Behrouz ; Mahdiraji, Alireza ; Abedjan, Ziawasch et al. / Optimizing Machine Learning Workloads in Collaborative Environments. Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020. 2020. pp. 1701-1716
Download
@inproceedings{b8363300fc254c298718efa74c4aed9d,
title = "Optimizing Machine Learning Workloads in Collaborative Environments",
abstract = "Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.",
keywords = "collaborative ML, machine learning, materialization and reuse",
author = "Behrouz Derakhshan and Alireza Mahdiraji and Ziawasch Abedjan and Tilmann Rabl and Volker Markl",
note = "Funding information: This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A) and German Federal Ministry for Economic Affairs and Energy, Project ”ExDRa” (01MD19002B).",
year = "2020",
month = jun,
doi = "10.1145/3318464.3389715",
language = "English",
pages = "1701--1716",
booktitle = "Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020",

}

Download

TY - GEN

T1 - Optimizing Machine Learning Workloads in Collaborative Environments

AU - Derakhshan, Behrouz

AU - Mahdiraji, Alireza

AU - Abedjan, Ziawasch

AU - Rabl, Tilmann

AU - Markl, Volker

N1 - Funding information: This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A) and German Federal Ministry for Economic Affairs and Energy, Project ”ExDRa” (01MD19002B).

PY - 2020/6

Y1 - 2020/6

N2 - Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.

AB - Effective collaboration among data scientists results in high-quality and efficient machine learning (ML) workloads. In a collaborative environment, such as Kaggle or Google Colabratory, users typically re-execute or modify published scripts to recreate or improve the result. This introduces many redundant data processing and model training operations. Reusing the data generated by the redundant operations leads to the more efficient execution of future workloads. However, existing collaborative environments lack a data management component for storing and reusing the result of previously executed operations. In this paper, we present a system to optimize the execution of ML workloads in collaborative environments by reusing previously performed operations and their results. We utilize a so-called Experiment Graph (EG) to store the artifacts, i.e., raw and intermediate data or ML models, as vertices and operations of ML workloads as edges. In theory, the size of EG can become unnecessarily large, while the storage budget might be limited. At the same time, for some artifacts, the overall storage and retrieval cost might outweigh the recomputation cost. To address this issue, we propose two algorithms for materializing artifacts based on their likelihood of future reuse. Given the materialized artifacts inside EG, we devise a linear-time reuse algorithm to find the optimal execution plan for incoming ML workloads. Our reuse algorithm only incurs a negligible overhead and scales for the high number of incoming ML workloads in collaborative environments. Our experiments show that we improve the run-time by one order of magnitude for repeated execution of the workloads and 50% for the execution of modified workloads in collaborative environments.

KW - collaborative ML

KW - machine learning

KW - materialization and reuse

UR - http://www.scopus.com/inward/record.url?scp=85086245280&partnerID=8YFLogxK

U2 - 10.1145/3318464.3389715

DO - 10.1145/3318464.3389715

M3 - Conference contribution

SP - 1701

EP - 1716

BT - Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020

ER -