Details
Original language | English |
---|---|
Title of host publication | Grundlagen von Datenbanken |
Subtitle of host publication | Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken |
Pages | 53-58 |
Number of pages | 6 |
Publication status | Published - 29 Jun 2018 |
Event | 30th GI-Workshop on the Foundations of Databases, GvDB 2018 - Wuppertal, Germany Duration: 22 May 2018 → 25 May 2018 |
Publication series
Name | CEUR Workshop Proceedings |
---|---|
Volume | 2126 |
ISSN (Print) | 1613-0073 |
Abstract
Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).
Keywords
- Blocking, Entity resolution, Matching
ASJC Scopus subject areas
- Computer Science(all)
- General Computer Science
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Grundlagen von Datenbanken: Proceedings of the 30th GI-Workshop Grundlagen von Datenbanken. 2018. p. 53-58 (CEUR Workshop Proceedings; Vol. 2126).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Blocking music metadata from heterogenous data sources
AU - Pabst, Oliver
AU - Lipeck, Udo W.
N1 - Publisher Copyright: © 2018 CEUR-WS. All rights reserved.
PY - 2018/6/29
Y1 - 2018/6/29
N2 - Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).
AB - Entity resolution or object matching describes the assignment of different objects to each other that describe the same object of the real world. It is used in a variety of technical systems, e.g. systems that fuse different data sources. Blocking is used in this context as an approach to reduce the total amount of comparisons by grouping similar objects in the same cluster and dissimilar objects in different clusters. As a result only the objects of the same clusters have to be compared to each other. To deal with noise, for instance spelling errors, that can result from different heterogeneous data sources, various blocking approaches exist that may add or remove redundancy to the data. In this paper we propose a system that utilizes a derivative of the standard blocking technique to compute correspondences between objects as starting points for a graph matching process. The blocking technique, which usually relies on identity of blocking keys derived from attributes, is modified to cope with heterogenous source data with few attributes suitable for matching. A common criticism of standard blocking is low efficiency, since the block sizes are unbalanced with regard to the number of contained entities. We take precautions to keep the efficiency high by reducing the size and amount of large partitions. Copyright is held by the author/owner(s).
KW - Blocking
KW - Entity resolution
KW - Matching
UR - http://www.scopus.com/inward/record.url?scp=85049799546&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85049799546
T3 - CEUR Workshop Proceedings
SP - 53
EP - 58
BT - Grundlagen von Datenbanken
T2 - 30th GI-Workshop on the Foundations of Databases, GvDB 2018
Y2 - 22 May 2018 through 25 May 2018
ER -