From Cleaning before ML to Cleaning for ML.

Felix Neutatz; Binger Chen; Ziawasch Abedjan; Eugene Wu

Details

Original language	English
Article number	1
Pages (from-to)	24-41
Number of pages	18
Journal	IEEE Data Eng. Bull.
Volume	44
Issue number	1
Publication status	Published - 2021

Abstract

Data cleaning is widely regarded as a critical piece of machine learning (ML) applications, as data errors can corrupt models in ways that cause the application to operate incorrectly, unfairly, or dangerously. Traditional data cleaning focuses on quality issues of a dataset in isolation of the application using the data—Cleaning Before ML—which can be inefficient and, counterintuitively, degrade the application further. While recent cleaning approaches take into account signals from the ML model, such as the model accuracy, they are still local to a specific model, and do not take into account the entire application’s semantics and user goals. What is needed is an end-to-end application-driven approach towards Cleaning For ML, that can leverage signals throughout the entire ML application to optimize the cleaning for application goals and to reduce manual cleaning efforts. This paper briefly reviews recent progress in Cleaning For ML, presents our vision of a holistic cleaning framework, and outlines new challenges that arise when data cleaning meets ML applications.

Cite this

From Cleaning before ML to Cleaning for ML. / Neutatz, Felix; Chen, Binger; Abedjan, Ziawasch et al.
In: IEEE Data Eng. Bull., Vol. 44, No. 1, 1, 2021, p. 24-41.

Research output: Contribution to journal › Article › Research › peer review

Neutatz, F, Chen, B, Abedjan, Z & Wu, E 2021, 'From Cleaning before ML to Cleaning for ML.', IEEE Data Eng. Bull., vol. 44, no. 1, 1, pp. 24-41. <https://www.semanticscholar.org/paper/From-Cleaning-before-ML-to-Cleaning-for-ML-Neutatz-Chen/3797db0472220535fb16d0b6f213cce6d5f3da42>

Neutatz, F., Chen, B., Abedjan, Z., & Wu, E. (2021). From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull., 44(1), 24-41. Article 1. https://www.semanticscholar.org/paper/From-Cleaning-before-ML-to-Cleaning-for-ML-Neutatz-Chen/3797db0472220535fb16d0b6f213cce6d5f3da42

Neutatz F, Chen B, Abedjan Z, Wu E. From Cleaning before ML to Cleaning for ML. IEEE Data Eng. Bull. 2021;44(1):24-41. 1.

Neutatz, Felix ; Chen, Binger ; Abedjan, Ziawasch et al. / From Cleaning before ML to Cleaning for ML. In: IEEE Data Eng. Bull. 2021 ; Vol. 44, No. 1. pp. 24-41.

Download

@article{b0bceb36d49849d4977516f91c54837f,

title = "From Cleaning before ML to Cleaning for ML.",

abstract = "Data cleaning is widely regarded as a critical piece of machine learning (ML) applications, as data errors can corrupt models in ways that cause the application to operate incorrectly, unfairly, or dangerously. Traditional data cleaning focuses on quality issues of a dataset in isolation of the application using the data—Cleaning Before ML—which can be inefficient and, counterintuitively, degrade the application further. While recent cleaning approaches take into account signals from the ML model, such as the model accuracy, they are still local to a specific model, and do not take into account the entire application{\textquoteright}s semantics and user goals. What is needed is an end-to-end application-driven approach towards Cleaning For ML, that can leverage signals throughout the entire ML application to optimize the cleaning for application goals and to reduce manual cleaning efforts. This paper briefly reviews recent progress in Cleaning For ML, presents our vision of a holistic cleaning framework, and outlines new challenges that arise when data cleaning meets ML applications.",

author = "Felix Neutatz and Binger Chen and Ziawasch Abedjan and Eugene Wu",

note = "Funding Information: This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A); Eugene Wu was funded by National Science Foundation awards 1564049, 1845638, and 2008295, Amazon and Google research awards, and a Columbia SIRS award.",

year = "2021",

language = "English",

volume = "44",

pages = "24--41",

number = "1",

}

Download

TY - JOUR

T1 - From Cleaning before ML to Cleaning for ML.

AU - Neutatz, Felix

AU - Chen, Binger

AU - Abedjan, Ziawasch

AU - Wu, Eugene

N1 - Funding Information: This work was funded by the German Ministry for Education and Research as BIFOLD - Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A); Eugene Wu was funded by National Science Foundation awards 1564049, 1845638, and 2008295, Amazon and Google research awards, and a Columbia SIRS award.

PY - 2021

Y1 - 2021

N2 - Data cleaning is widely regarded as a critical piece of machine learning (ML) applications, as data errors can corrupt models in ways that cause the application to operate incorrectly, unfairly, or dangerously. Traditional data cleaning focuses on quality issues of a dataset in isolation of the application using the data—Cleaning Before ML—which can be inefficient and, counterintuitively, degrade the application further. While recent cleaning approaches take into account signals from the ML model, such as the model accuracy, they are still local to a specific model, and do not take into account the entire application’s semantics and user goals. What is needed is an end-to-end application-driven approach towards Cleaning For ML, that can leverage signals throughout the entire ML application to optimize the cleaning for application goals and to reduce manual cleaning efforts. This paper briefly reviews recent progress in Cleaning For ML, presents our vision of a holistic cleaning framework, and outlines new challenges that arise when data cleaning meets ML applications.

AB - Data cleaning is widely regarded as a critical piece of machine learning (ML) applications, as data errors can corrupt models in ways that cause the application to operate incorrectly, unfairly, or dangerously. Traditional data cleaning focuses on quality issues of a dataset in isolation of the application using the data—Cleaning Before ML—which can be inefficient and, counterintuitively, degrade the application further. While recent cleaning approaches take into account signals from the ML model, such as the model accuracy, they are still local to a specific model, and do not take into account the entire application’s semantics and user goals. What is needed is an end-to-end application-driven approach towards Cleaning For ML, that can leverage signals throughout the entire ML application to optimize the cleaning for application goals and to reduce manual cleaning efforts. This paper briefly reviews recent progress in Cleaning For ML, presents our vision of a holistic cleaning framework, and outlines new challenges that arise when data cleaning meets ML applications.

UR - http://sites.computer.org/debull/A21mar/p24.pdf

M3 - Article

VL - 44

SP - 24

EP - 41

JO - IEEE Data Eng. Bull.

JF - IEEE Data Eng. Bull.

IS - 1

M1 - 1

ER -

Research@Leibniz University

From Cleaning before ML to Cleaning for ML.

Authors

Research Organisations

External Research Organisations

Details

Abstract

Cite this