Raha: A Configuration-Free Error Detection System

Ziawasch Abedjan; Mohammad Mahdavi; Raul Castro Fernandez; Samuel Ross Madden; Mourad Quzzani; M.R. Stonebraker; Nan Tang

doi:10.1145/3299869.3324956

Details

Originalsprache	Englisch
Titel des Sammelwerks	SIGMOD '19
Untertitel	Proceedings of the 2019 International Conference on Management of Data
Erscheinungsort	New York
Herausgeber (Verlag)	Association for Computing Machinery (ACM)
Seiten	865-882
Seitenumfang	18
ISBN (elektronisch)	9781450356435
Publikationsstatus	Veröffentlicht - 25 Juni 2019
Veranstaltung	SIGMOD/PODS '19: International Conference on Management of Data - Amsterdam, Niederlande Dauer: 30 Juni 2019 → 5 Juli 2019

Abstract

Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

ASJC Scopus Sachgebiete

Informatik (insg.)
Software
Informatik (insg.)
Information systems

Zitieren

Raha: A Configuration-Free Error Detection System. / Abedjan, Ziawasch; Mahdavi, Mohammad; Fernandez, Raul Castro et al.
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM), 2019. S. 865-882.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Abedjan, Z, Mahdavi, M, Fernandez, RC, Madden, SR, Quzzani, M, Stonebraker, MR & Tang, N 2019, Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery (ACM), New York, S. 865-882, SIGMOD/PODS '19, Niederlande, 30 Juni 2019. https://doi.org/10.1145/3299869.3324956

Abedjan, Z., Mahdavi, M., Fernandez, R. C., Madden, S. R., Quzzani, M., Stonebraker, M. R., & Tang, N. (2019). Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data (S. 865-882). Association for Computing Machinery (ACM). https://doi.org/10.1145/3299869.3324956

Abedjan Z, Mahdavi M, Fernandez RC, Madden SR, Quzzani M, Stonebraker MR et al. Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM). 2019. S. 865-882 doi: 10.1145/3299869.3324956

Abedjan, Ziawasch ; Mahdavi, Mohammad ; Fernandez, Raul Castro et al. / Raha : A Configuration-Free Error Detection System. SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York : Association for Computing Machinery (ACM), 2019. S. 865-882

Download

@inproceedings{dc91f9ae39794094857f6ce0ffdeaa5d,

title = "Raha: A Configuration-Free Error Detection System",

abstract = "Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.",

author = "Ziawasch Abedjan and Mohammad Mahdavi and Fernandez, {Raul Castro} and Madden, {Samuel Ross} and Mourad Quzzani and M.R. Stonebraker and Nan Tang",

note = "Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.; SIGMOD/PODS '19 ; Conference date: 30-06-2019 Through 05-07-2019",

year = "2019",

month = jun,

day = "25",

doi = "10.1145/3299869.3324956",

language = "English",

pages = "865--882",

booktitle = "SIGMOD '19",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Download

TY - GEN

T1 - Raha

T2 - SIGMOD/PODS '19

AU - Abedjan, Ziawasch

AU - Mahdavi, Mohammad

AU - Fernandez, Raul Castro

AU - Madden, Samuel Ross

AU - Quzzani, Mourad

AU - Stonebraker, M.R.

AU - Tang, Nan

N1 - Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2019/6/25

Y1 - 2019/6/25

N2 - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

AB - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

UR - http://www.scopus.com/inward/record.url?scp=85069437614&partnerID=8YFLogxK

U2 - 10.1145/3299869.3324956

DO - 10.1145/3299869.3324956

M3 - Conference contribution

SP - 865

EP - 882

BT - SIGMOD '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 30 June 2019 through 5 July 2019

ER -

Research@Leibniz University

Raha: A Configuration-Free Error Detection System

Autoren

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren