Duplicate Table Detection with Xash

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Maximilian Koch
  • Mahdi Esmailoghli
  • Sören Auer
  • Ziawasch Abedjan

Organisationseinheiten

Externe Organisationen

  • Technische Informationsbibliothek (TIB) Leibniz-Informationszentrum Technik und Naturwissenschaften und Universitätsbibliothek
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksDatenbanksysteme fur Business, Technologie und Web
UntertitelBTW 2013
Herausgeber/-innenBirgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen
Herausgeber (Verlag)Gesellschaft fur Informatik (GI)
Seiten367-390
Seitenumfang24
ISBN (elektronisch)9783885797258
PublikationsstatusVeröffentlicht - 2023
Veranstaltung2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Deutschland
Dauer: 6 März 202310 März 2023

Publikationsreihe

NameLecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
BandP-331
ISSN (Print)1617-5468

Abstract

Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

ASJC Scopus Sachgebiete

Zitieren

Duplicate Table Detection with Xash. / Koch, Maximilian; Esmailoghli, Mahdi; Auer, Sören et al.
Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Hrsg. / Birgitta Konig-Ries; Stefanie Scherzinger; Wolfgang Lehner; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. S. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Band P-331).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Koch, M, Esmailoghli, M, Auer, S & Abedjan, Z 2023, Duplicate Table Detection with Xash. in B Konig-Ries, S Scherzinger, W Lehner & G Vossen (Hrsg.), Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), Bd. P-331, Gesellschaft fur Informatik (GI), S. 367-390, 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023, Dresden, Deutschland, 6 März 2023. https://doi.org/10.18420/BTW2023-18
Koch, M., Esmailoghli, M., Auer, S., & Abedjan, Z. (2023). Duplicate Table Detection with Xash. In B. Konig-Ries, S. Scherzinger, W. Lehner, & G. Vossen (Hrsg.), Datenbanksysteme fur Business, Technologie und Web: BTW 2013 (S. 367-390). (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Band P-331). Gesellschaft fur Informatik (GI). https://doi.org/10.18420/BTW2023-18
Koch M, Esmailoghli M, Auer S, Abedjan Z. Duplicate Table Detection with Xash. in Konig-Ries B, Scherzinger S, Lehner W, Vossen G, Hrsg., Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Gesellschaft fur Informatik (GI). 2023. S. 367-390. (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)). doi: 10.18420/BTW2023-18
Koch, Maximilian ; Esmailoghli, Mahdi ; Auer, Sören et al. / Duplicate Table Detection with Xash. Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Hrsg. / Birgitta Konig-Ries ; Stefanie Scherzinger ; Wolfgang Lehner ; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. S. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)).
Download
@inproceedings{140d25a9567249ada3e6de51f41217f2,
title = "Duplicate Table Detection with Xash",
abstract = "Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.",
keywords = "data discovery, data lakes, duplicate table detection",
author = "Maximilian Koch and Mahdi Esmailoghli and S{\"o}ren Auer and Ziawasch Abedjan",
note = "Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445. ; 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 ; Conference date: 06-03-2023 Through 10-03-2023",
year = "2023",
doi = "10.18420/BTW2023-18",
language = "English",
series = "Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)",
publisher = "Gesellschaft fur Informatik (GI)",
pages = "367--390",
editor = "Birgitta Konig-Ries and Stefanie Scherzinger and Wolfgang Lehner and Gottfried Vossen",
booktitle = "Datenbanksysteme fur Business, Technologie und Web",
address = "Germany",

}

Download

TY - GEN

T1 - Duplicate Table Detection with Xash

AU - Koch, Maximilian

AU - Esmailoghli, Mahdi

AU - Auer, Sören

AU - Abedjan, Ziawasch

N1 - Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2023

Y1 - 2023

N2 - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

AB - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

KW - data discovery

KW - data lakes

KW - duplicate table detection

UR - http://www.scopus.com/inward/record.url?scp=85149970671&partnerID=8YFLogxK

U2 - 10.18420/BTW2023-18

DO - 10.18420/BTW2023-18

M3 - Conference contribution

AN - SCOPUS:85149970671

T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)

SP - 367

EP - 390

BT - Datenbanksysteme fur Business, Technologie und Web

A2 - Konig-Ries, Birgitta

A2 - Scherzinger, Stefanie

A2 - Lehner, Wolfgang

A2 - Vossen, Gottfried

PB - Gesellschaft fur Informatik (GI)

T2 - 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023

Y2 - 6 March 2023 through 10 March 2023

ER -

Von denselben Autoren