Raha: A Configuration-Free Error Detection System

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Ziawasch Abedjan
  • Mohammad Mahdavi
  • Raul Castro Fernandez
  • Samuel Ross Madden
  • Mourad Quzzani
  • M.R. Stonebraker
  • Nan Tang

External Research Organisations

  • Technische Universität Berlin
  • Massachusetts Institute of Technology
View graph of relations

Details

Original languageEnglish
Title of host publicationSIGMOD '19
Subtitle of host publicationProceedings of the 2019 International Conference on Management of Data
Place of PublicationNew York
PublisherAssociation for Computing Machinery (ACM)
Pages865-882
Number of pages18
ISBN (electronic)9781450356435
Publication statusPublished - 25 Jun 2019
EventSIGMOD/PODS '19: International Conference on Management of Data - Amsterdam, Netherlands
Duration: 30 Jun 20195 Jul 2019

Abstract

Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

ASJC Scopus subject areas

Cite this

Raha: A Configuration-Free Error Detection System. / Abedjan, Ziawasch; Mahdavi, Mohammad; Fernandez, Raul Castro et al.
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM), 2019. p. 865-882.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Abedjan, Z, Mahdavi, M, Fernandez, RC, Madden, SR, Quzzani, M, Stonebraker, MR & Tang, N 2019, Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery (ACM), New York, pp. 865-882, SIGMOD/PODS '19, Netherlands, 30 Jun 2019. https://doi.org/10.1145/3299869.3324956
Abedjan, Z., Mahdavi, M., Fernandez, R. C., Madden, S. R., Quzzani, M., Stonebraker, M. R., & Tang, N. (2019). Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data (pp. 865-882). Association for Computing Machinery (ACM). https://doi.org/10.1145/3299869.3324956
Abedjan Z, Mahdavi M, Fernandez RC, Madden SR, Quzzani M, Stonebraker MR et al. Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM). 2019. p. 865-882 doi: 10.1145/3299869.3324956
Abedjan, Ziawasch ; Mahdavi, Mohammad ; Fernandez, Raul Castro et al. / Raha : A Configuration-Free Error Detection System. SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York : Association for Computing Machinery (ACM), 2019. pp. 865-882
Download
@inproceedings{dc91f9ae39794094857f6ce0ffdeaa5d,
title = "Raha: A Configuration-Free Error Detection System",
abstract = "Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.",
author = "Ziawasch Abedjan and Mohammad Mahdavi and Fernandez, {Raul Castro} and Madden, {Samuel Ross} and Mourad Quzzani and M.R. Stonebraker and Nan Tang",
note = "Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.; SIGMOD/PODS '19 ; Conference date: 30-06-2019 Through 05-07-2019",
year = "2019",
month = jun,
day = "25",
doi = "10.1145/3299869.3324956",
language = "English",
pages = "865--882",
booktitle = "SIGMOD '19",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",

}

Download

TY - GEN

T1 - Raha

T2 - SIGMOD/PODS '19

AU - Abedjan, Ziawasch

AU - Mahdavi, Mohammad

AU - Fernandez, Raul Castro

AU - Madden, Samuel Ross

AU - Quzzani, Mourad

AU - Stonebraker, M.R.

AU - Tang, Nan

N1 - Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2019/6/25

Y1 - 2019/6/25

N2 - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

AB - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

UR - http://www.scopus.com/inward/record.url?scp=85069437614&partnerID=8YFLogxK

U2 - 10.1145/3299869.3324956

DO - 10.1145/3299869.3324956

M3 - Conference contribution

SP - 865

EP - 882

BT - SIGMOD '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 30 June 2019 through 5 July 2019

ER -