Duplicate Table Detection with Xash

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Maximilian Koch
  • Mahdi Esmailoghli
  • Sören Auer
  • Ziawasch Abedjan

Research Organisations

External Research Organisations

  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Title of host publicationDatenbanksysteme fur Business, Technologie und Web
Subtitle of host publicationBTW 2013
EditorsBirgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen
PublisherGesellschaft fur Informatik (GI)
Pages367-390
Number of pages24
ISBN (electronic)9783885797258
Publication statusPublished - 2023
Event2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Germany
Duration: 6 Mar 202310 Mar 2023

Publication series

NameLecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
VolumeP-331
ISSN (Print)1617-5468

Abstract

Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

Keywords

    data discovery, data lakes, duplicate table detection

ASJC Scopus subject areas

Cite this

Duplicate Table Detection with Xash. / Koch, Maximilian; Esmailoghli, Mahdi; Auer, Sören et al.
Datenbanksysteme fur Business, Technologie und Web: BTW 2013. ed. / Birgitta Konig-Ries; Stefanie Scherzinger; Wolfgang Lehner; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. p. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-331).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Koch, M, Esmailoghli, M, Auer, S & Abedjan, Z 2023, Duplicate Table Detection with Xash. in B Konig-Ries, S Scherzinger, W Lehner & G Vossen (eds), Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), vol. P-331, Gesellschaft fur Informatik (GI), pp. 367-390, 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023, Dresden, Germany, 6 Mar 2023. https://doi.org/10.18420/BTW2023-18
Koch, M., Esmailoghli, M., Auer, S., & Abedjan, Z. (2023). Duplicate Table Detection with Xash. In B. Konig-Ries, S. Scherzinger, W. Lehner, & G. Vossen (Eds.), Datenbanksysteme fur Business, Technologie und Web: BTW 2013 (pp. 367-390). (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-331). Gesellschaft fur Informatik (GI). https://doi.org/10.18420/BTW2023-18
Koch M, Esmailoghli M, Auer S, Abedjan Z. Duplicate Table Detection with Xash. In Konig-Ries B, Scherzinger S, Lehner W, Vossen G, editors, Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Gesellschaft fur Informatik (GI). 2023. p. 367-390. (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)). doi: 10.18420/BTW2023-18
Koch, Maximilian ; Esmailoghli, Mahdi ; Auer, Sören et al. / Duplicate Table Detection with Xash. Datenbanksysteme fur Business, Technologie und Web: BTW 2013. editor / Birgitta Konig-Ries ; Stefanie Scherzinger ; Wolfgang Lehner ; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. pp. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)).
Download
@inproceedings{140d25a9567249ada3e6de51f41217f2,
title = "Duplicate Table Detection with Xash",
abstract = "Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.",
keywords = "data discovery, data lakes, duplicate table detection",
author = "Maximilian Koch and Mahdi Esmailoghli and S{\"o}ren Auer and Ziawasch Abedjan",
note = "Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445. ; 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 ; Conference date: 06-03-2023 Through 10-03-2023",
year = "2023",
doi = "10.18420/BTW2023-18",
language = "English",
series = "Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)",
publisher = "Gesellschaft fur Informatik (GI)",
pages = "367--390",
editor = "Birgitta Konig-Ries and Stefanie Scherzinger and Wolfgang Lehner and Gottfried Vossen",
booktitle = "Datenbanksysteme fur Business, Technologie und Web",
address = "Germany",

}

Download

TY - GEN

T1 - Duplicate Table Detection with Xash

AU - Koch, Maximilian

AU - Esmailoghli, Mahdi

AU - Auer, Sören

AU - Abedjan, Ziawasch

N1 - Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2023

Y1 - 2023

N2 - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

AB - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

KW - data discovery

KW - data lakes

KW - duplicate table detection

UR - http://www.scopus.com/inward/record.url?scp=85149970671&partnerID=8YFLogxK

U2 - 10.18420/BTW2023-18

DO - 10.18420/BTW2023-18

M3 - Conference contribution

AN - SCOPUS:85149970671

T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)

SP - 367

EP - 390

BT - Datenbanksysteme fur Business, Technologie und Web

A2 - Konig-Ries, Birgitta

A2 - Scherzinger, Stefanie

A2 - Lehner, Wolfgang

A2 - Vossen, Gottfried

PB - Gesellschaft fur Informatik (GI)

T2 - 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023

Y2 - 6 March 2023 through 10 March 2023

ER -

By the same author(s)