Raha: A Configuration-Free Error Detection System

Ziawasch Abedjan; Mohammad Mahdavi; Raul Castro Fernandez; Samuel Ross Madden; Mourad Quzzani; M.R. Stonebraker; Nan Tang

doi:10.1145/3299869.3324956

Details

Original language	English
Title of host publication	SIGMOD '19
Subtitle of host publication	Proceedings of the 2019 International Conference on Management of Data
Place of Publication	New York
Publisher	Association for Computing Machinery (ACM)
Pages	865-882
Number of pages	18
ISBN (electronic)	9781450356435
Publication status	Published - 25 Jun 2019
Event	SIGMOD/PODS '19: International Conference on Management of Data - Amsterdam, Netherlands Duration: 30 Jun 2019 → 5 Jul 2019

Abstract

Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Information Systems

Cite this

Raha: A Configuration-Free Error Detection System. / Abedjan, Ziawasch; Mahdavi, Mohammad; Fernandez, Raul Castro et al.
SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM), 2019. p. 865-882.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Abedjan, Z, Mahdavi, M, Fernandez, RC, Madden, SR, Quzzani, M, Stonebraker, MR & Tang, N 2019, Raha: A Configuration-Free Error Detection System. in SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. Association for Computing Machinery (ACM), New York, pp. 865-882, SIGMOD/PODS '19, Netherlands, 30 Jun 2019. https://doi.org/10.1145/3299869.3324956

Abedjan, Z., Mahdavi, M., Fernandez, R. C., Madden, S. R., Quzzani, M., Stonebraker, M. R., & Tang, N. (2019). Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data (pp. 865-882). Association for Computing Machinery (ACM). https://doi.org/10.1145/3299869.3324956

Abedjan Z, Mahdavi M, Fernandez RC, Madden SR, Quzzani M, Stonebraker MR et al. Raha: A Configuration-Free Error Detection System. In SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York: Association for Computing Machinery (ACM). 2019. p. 865-882 doi: 10.1145/3299869.3324956

Abedjan, Ziawasch ; Mahdavi, Mohammad ; Fernandez, Raul Castro et al. / Raha : A Configuration-Free Error Detection System. SIGMOD '19: Proceedings of the 2019 International Conference on Management of Data. New York : Association for Computing Machinery (ACM), 2019. pp. 865-882

Download

@inproceedings{dc91f9ae39794094857f6ce0ffdeaa5d,

title = "Raha: A Configuration-Free Error Detection System",

abstract = "Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.",

author = "Ziawasch Abedjan and Mohammad Mahdavi and Fernandez, {Raul Castro} and Madden, {Samuel Ross} and Mourad Quzzani and M.R. Stonebraker and Nan Tang",

note = "Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.; SIGMOD/PODS '19 ; Conference date: 30-06-2019 Through 05-07-2019",

year = "2019",

month = jun,

day = "25",

doi = "10.1145/3299869.3324956",

language = "English",

pages = "865--882",

booktitle = "SIGMOD '19",

publisher = "Association for Computing Machinery (ACM)",

address = "United States",

}

Download

TY - GEN

T1 - Raha

T2 - SIGMOD/PODS '19

AU - Abedjan, Ziawasch

AU - Mahdavi, Mohammad

AU - Fernandez, Raul Castro

AU - Madden, Samuel Ross

AU - Quzzani, Mourad

AU - Stonebraker, M.R.

AU - Tang, Nan

N1 - Funding information: This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2019/6/25

Y1 - 2019/6/25

N2 - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

AB - Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.

UR - http://www.scopus.com/inward/record.url?scp=85069437614&partnerID=8YFLogxK

U2 - 10.1145/3299869.3324956

DO - 10.1145/3299869.3324956

M3 - Conference contribution

SP - 865

EP - 882

BT - SIGMOD '19

PB - Association for Computing Machinery (ACM)

CY - New York

Y2 - 30 June 2019 through 5 July 2019

ER -

Research@Leibniz University

Raha: A Configuration-Free Error Detection System

Authors

Research Organisations

External Research Organisations

Details

Abstract

ASJC Scopus subject areas

Cite this