Duplicate Table Detection with Xash

Maximilian Koch; Mahdi Esmailoghli; Sören Auer; Ziawasch Abedjan

doi:10.18420/BTW2023-18

Details

Original language	English
Title of host publication	Datenbanksysteme fur Business, Technologie und Web
Subtitle of host publication	BTW 2013
Editors	Birgitta Konig-Ries, Stefanie Scherzinger, Wolfgang Lehner, Gottfried Vossen
Publisher	Gesellschaft fur Informatik (GI)
Pages	367-390
Number of pages	24
ISBN (electronic)	9783885797258
Publication status	Published - 2023
Event	2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 - Dresden, Germany Duration: 6 Mar 2023 → 10 Mar 2023

Publication series

Name	Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)
Volume	P-331
ISSN (Print)	1617-5468
ISSN (electronic)	2944-7682

Abstract

Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

Keywords

data discovery, data lakes, duplicate table detection

ASJC Scopus subject areas

Computer Science(all)
Computer Science Applications

Cite this

Duplicate Table Detection with Xash. / Koch, Maximilian; Esmailoghli, Mahdi; Auer, Sören et al.
Datenbanksysteme fur Business, Technologie und Web: BTW 2013. ed. / Birgitta Konig-Ries; Stefanie Scherzinger; Wolfgang Lehner; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. p. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-331).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Koch, M, Esmailoghli, M, Auer, S & Abedjan, Z 2023, Duplicate Table Detection with Xash. in B Konig-Ries, S Scherzinger, W Lehner & G Vossen (eds), Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI), vol. P-331, Gesellschaft fur Informatik (GI), pp. 367-390, 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023, Dresden, Germany, 6 Mar 2023. https://doi.org/10.18420/BTW2023-18

Koch, M., Esmailoghli, M., Auer, S., & Abedjan, Z. (2023). Duplicate Table Detection with Xash. In B. Konig-Ries, S. Scherzinger, W. Lehner, & G. Vossen (Eds.), Datenbanksysteme fur Business, Technologie und Web: BTW 2013 (pp. 367-390). (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI); Vol. P-331). Gesellschaft fur Informatik (GI). https://doi.org/10.18420/BTW2023-18

Koch M, Esmailoghli M, Auer S, Abedjan Z. Duplicate Table Detection with Xash. In Konig-Ries B, Scherzinger S, Lehner W, Vossen G, editors, Datenbanksysteme fur Business, Technologie und Web: BTW 2013. Gesellschaft fur Informatik (GI). 2023. p. 367-390. (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)). doi: 10.18420/BTW2023-18

Koch, Maximilian ; Esmailoghli, Mahdi ; Auer, Sören et al. / Duplicate Table Detection with Xash. Datenbanksysteme fur Business, Technologie und Web: BTW 2013. editor / Birgitta Konig-Ries ; Stefanie Scherzinger ; Wolfgang Lehner ; Gottfried Vossen. Gesellschaft fur Informatik (GI), 2023. pp. 367-390 (Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)).

Download

@inproceedings{140d25a9567249ada3e6de51f41217f2,

title = "Duplicate Table Detection with Xash",

abstract = "Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.",

keywords = "data discovery, data lakes, duplicate table detection",

author = "Maximilian Koch and Mahdi Esmailoghli and S{\"o}ren Auer and Ziawasch Abedjan",

note = "Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445. ; 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023 ; Conference date: 06-03-2023 Through 10-03-2023",

year = "2023",

doi = "10.18420/BTW2023-18",

language = "English",

series = "Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)",

publisher = "Gesellschaft fur Informatik (GI)",

pages = "367--390",

editor = "Birgitta Konig-Ries and Stefanie Scherzinger and Wolfgang Lehner and Gottfried Vossen",

booktitle = "Datenbanksysteme fur Business, Technologie und Web",

address = "Germany",

}

Download

TY - GEN

T1 - Duplicate Table Detection with Xash

AU - Koch, Maximilian

AU - Esmailoghli, Mahdi

AU - Auer, Sören

AU - Abedjan, Ziawasch

N1 - Funding Information: Future improvements for table de-duplication could be to consider hashing for the grouping phase of large table corpora and to devise algorithms that are independent of lake indexes. Furthermore, it would be interesting to research fuzzy table duplicates. Our current approaches consider tables to be duplicates only when two tables contain the same set of tuples regardless of row and column order. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445.

PY - 2023

Y1 - 2023

N2 - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

AB - Data lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data. Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to SimHash and other competing hash functions, Xash results in fewer false positive candidates.

KW - data discovery

KW - data lakes

KW - duplicate table detection

UR - http://www.scopus.com/inward/record.url?scp=85149970671&partnerID=8YFLogxK

U2 - 10.18420/BTW2023-18

DO - 10.18420/BTW2023-18

M3 - Conference contribution

AN - SCOPUS:85149970671

T3 - Lecture Notes in Informatics (LNI), Proceedings - Series of the Gesellschaft fur Informatik (GI)

SP - 367

EP - 390

BT - Datenbanksysteme fur Business, Technologie und Web

A2 - Konig-Ries, Birgitta

A2 - Scherzinger, Stefanie

A2 - Lehner, Wolfgang

A2 - Vossen, Gottfried

PB - Gesellschaft fur Informatik (GI)

T2 - 2023 Datenbanksysteme fur Business, Technologie und Web, BTW 2023 - 2023 Database Systems for Business, Technology and Web, BTW 2023

Y2 - 6 March 2023 through 10 March 2023

ER -

Research@Leibniz University

Duplicate Table Detection with Xash

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

DataDesc: A framework for creating and sharing technical metadata for research software interfaces

Organizing Scientific Knowledge from Engineering Sciences Using the Open Research Knowledge Graph: The Tailored Forming Process Chain Use Case

A Neuro-Symbolic Approach for Faceted Search in Digital Libraries

Leveraging GPT Models For Semantic Table Annotation

Managing Comprehensive Research Instrument Descriptions Within a Scholarly Knowledge Graph

DataDesc: A framework for creating and sharing technical metadata for research software interfaces

Organizing Scientific Knowledge from Engineering Sciences Using the Open Research Knowledge Graph: The Tailored Forming Process Chain Use Case

A Neuro-Symbolic Approach for Faceted Search in Digital Libraries

Leveraging GPT Models For Semantic Table Annotation

Managing Comprehensive Research Instrument Descriptions Within a Scholarly Knowledge Graph

DataDesc: A framework for creating and sharing technical metadata for research software interfaces