Raha: A Configuration-Free Error Detection System

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Ziawasch Abedjan
  • Mohammad Mahdavi
  • Raul Castro Fernandez
  • Samuel Ross Madden
  • Mourad Quzzani
  • M.R. Stonebraker
  • Nan Tang

Externe Organisationen

  • Technische Universität Berlin
  • Massachusetts Institute of Technology (MIT)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksSIGMOD '19
UntertitelProceedings of the 2019 International Conference on Management of Data
ErscheinungsortNew York
Herausgeber (Verlag)Association for Computing Machinery (ACM)
Seiten865-882
Seitenumfang18
ISBN (elektronisch)9781450356435
PublikationsstatusVeröffentlicht - 25 Juni 2019
VeranstaltungSIGMOD/PODS '19: International Conference on Management of Data - Amsterdam, Niederlande
Dauer: 30 Juni 20195 Juli 2019

Abstract

Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

ASJC Scopus Sachgebiete

Zitieren

Raha: A Configuration-Free Error Detection System. / Abedjan, Ziawasch; Mahdavi, Mohammad; Fernandez, Raul Castro et al.
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM), 2019. S. 865-882.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Abedjan, Z, Mahdavi, M, Fernandez, RC, Madden, SR, Quzzani, M, Stonebraker, MR & Tang, N 2019, Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery (ACM), New York, S. 865-882, SIGMOD/PODS '19, Niederlande, 30 Juni 2019. https://doi.org/10.1145/3299869.3324956
Abedjan, Z., Mahdavi, M., Fernandez, R. C., Madden, S. R., Quzzani, M., Stonebraker, M. R., & Tang, N. (2019). Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data (S. 865-882). Association for Computing Machinery (ACM). https://doi.org/10.1145/3299869.3324956
Abedjan Z, Mahdavi M, Fernandez RC, Madden SR, Quzzani M, Stonebraker MR et al. Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM). 2019. S. 865-882 doi: 10.1145/3299869.3324956
Abedjan, Ziawasch ; Mahdavi, Mohammad ; Fernandez, Raul Castro et al. / Raha : A Configuration-Free Error Detection System. SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York : Association for Computing Machinery (ACM), 2019. S. 865-882
Download
@inproceedings{dc91f9ae39794094857f6ce0ffdeaa5d,
title = "Raha: A Configuration-Free Error Detection System",
abstract = "Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.",
author = "Ziawasch Abedjan and Mohammad Mahdavi and Fernandez, {Raul Castro} and Madden, {Samuel Ross} and Mourad Quzzani and M.R. Stonebraker and Nan Tang",
note = "Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.; SIGMOD/PODS '19 ; Conference date: 30-06-2019 Through 05-07-2019",
year = "2019",
month = jun,
day = "25",
doi = "10.1145/3299869.3324956",
language = "English",
pages = "865--882",
booktitle = "SIGMOD '19",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Download

TY - GEN

T1 - Raha

T2 - SIGMOD/PODS '19

AU - Abedjan, Ziawasch

AU - Mahdavi, Mohammad

AU - Fernandez, Raul Castro

AU - Madden, Samuel Ross

AU - Quzzani, Mourad

AU - Stonebraker, M.R.

AU - Tang, Nan

N1 - Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2019/6/25

Y1 - 2019/6/25

N2 - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

AB - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

UR - http://www.scopus.com/inward/record.url?scp=85069437614&partnerID=8YFLogxK

U2 - 10.1145/3299869.3324956

DO - 10.1145/3299869.3324956

M3 - Conference contribution

SP - 865

EP - 882

BT - SIGMOD '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 30 June 2019 through 5 July 2019

ER -