Blocking music metadata from heterogenous data sources

Oliver Pabst; Udo W. Lipeck

Details

Original language	English
Title of host publication	Grundlagen von Datenbanken
Subtitle of host publication	Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken
Pages	53-58
Number of pages	6
Publication status	Published - 29 Jun 2018
Event	30th GI-Workshop on the Foundations of Databases, GvDB 2018 - Wuppertal, Germany Duration: 22 May 2018 → 25 May 2018

Publication series

Name	CEUR Workshop Proceedings
Volume	2126
ISSN (Print)	1613-0073

Abstract

Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).

Keywords

Blocking, Entity resolution, Matching

ASJC Scopus subject areas

Computer Science(all)
General Computer Science

Cite this

Blocking music metadata from heterogenous data sources. / Pabst, Oliver; Lipeck, Udo W.
Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken. 2018. p. 53-58 (CEUR Workshop Proceedings; Vol. 2126).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Pabst, O & Lipeck, UW 2018, Blocking music metadata from heterogenous data sources. in Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken. CEUR Workshop Proceedings, vol. 2126, pp. 53-58, 30th GI-Workshop on the Foundations of Databases, GvDB 2018, Wuppertal, Germany, 22 May 2018. <https://ceur-ws.org/Vol-2126/>

Pabst, O., & Lipeck, U. W. (2018). Blocking music metadata from heterogenous data sources. In Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken (pp. 53-58). (CEUR Workshop Proceedings; Vol. 2126). https://ceur-ws.org/Vol-2126/

Pabst O, Lipeck UW. Blocking music metadata from heterogenous data sources. In Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken. 2018. p. 53-58. (CEUR Workshop Proceedings).

Pabst, Oliver ; Lipeck, Udo W. / Blocking music metadata from heterogenous data sources. Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken. 2018. pp. 53-58 (CEUR Workshop Proceedings).

Download

@inproceedings{4ef83fa7af9f49c3b6018dd700fbd012,

title = "Blocking music metadata from heterogenous data sources",

abstract = "Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).",

keywords = "Blocking, Entity resolution, Matching",

author = "Oliver Pabst and Lipeck, {Udo W.}",

year = "2018",

month = jun,

day = "29",

language = "English",

series = "CEUR Workshop Proceedings",

pages = "53--58",

booktitle = "Grundlagen von Datenbanken",

}

Download

TY - GEN

T1 - Blocking music metadata from heterogenous data sources

AU - Pabst, Oliver

AU - Lipeck, Udo W.

PY - 2018/6/29

Y1 - 2018/6/29

N2 - Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).

AB - Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).

KW - Blocking

KW - Entity resolution

KW - Matching

UR - http://www.scopus.com/inward/record.url?scp=85049799546&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85049799546

T3 - CEUR Workshop Proceedings

SP - 53

EP - 58

BT - Grundlagen von Datenbanken

T2 - 30th GI-Workshop on the Foundations of Databases, GvDB 2018

Y2 - 22 May 2018 through 25 May 2018

ER -

Research@Leibniz University