Details
Original language | English |
---|---|
Title of host publication | Datenbanksysteme fur Business, Technologie und Web |
Subtitle of host publication | BTW 2013 |
Editors | Birgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen |
Publisher | Gesellschaft fur Informatik (GI) |
Pages | 367-390 |
Number of pages | 24 |
ISBN (electronic) | 9783885797258 |
Publication status | Published - 2023 |
Event | 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Germany Duration: 6 Mar 2023 → 10 Mar 2023 |
Publication series
Name | Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI) |
---|---|
Volume | P-331 |
ISSN (Print) | 1617-5468 |
Abstract
Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.
Keywords
- data discovery, data lakes, duplicate table detection
ASJC Scopus subject areas
- Computer Science(all)
- Computer Science Applications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Datenbanksysteme fur Business, Technologie und Web: BTW 2013. ed. / Birgitta Konig-Ries; Stefanie Scherzinger; Wolfgang Lehner; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. p. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-331).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Duplicate Table Detection with Xash
AU - Koch, Maximilian
AU - Esmailoghli, Mahdi
AU - Auer, Sören
AU - Abedjan, Ziawasch
N1 - Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.
PY - 2023
Y1 - 2023
N2 - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.
AB - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.
KW - data discovery
KW - data lakes
KW - duplicate table detection
UR - http://www.scopus.com/inward/record.url?scp=85149970671&partnerID=8YFLogxK
U2 - 10.18420/BTW2023-18
DO - 10.18420/BTW2023-18
M3 - Conference contribution
AN - SCOPUS:85149970671
T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
SP - 367
EP - 390
BT - Datenbanksysteme fur Business, Technologie und Web
A2 - Konig-Ries, Birgitta
A2 - Scherzinger, Stefanie
A2 - Lehner, Wolfgang
A2 - Vossen, Gottfried
PB - Gesellschaft fur Informatik (GI)
T2 - 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023
Y2 - 6 March 2023 through 10 March 2023
ER -