Details
Original language | English |
---|---|
Article number | 102557 |
Number of pages | 23 |
Journal | Information Fusion |
Volume | 112 |
Early online date | 4 Jul 2024 |
Publication status | E-pub ahead of print - 4 Jul 2024 |
Abstract
In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.
Keywords
- Information fusion, Process-based data, Shape-based validation languages
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Signal Processing
- Computer Science(all)
- Information Systems
- Computer Science(all)
- Hardware and Architecture
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Information Fusion, Vol. 112, 102557, 12.2024.
Research output: Contribution to journal › Article › Research › peer review
}
TY - JOUR
T1 - PALADIN
T2 - A process-based constraint language for data validation
AU - Diaz-Honrubia, Antonio Jesus
AU - Rohde, Philipp D.
AU - Niazmand, Emetis
AU - Menasalvas, Ernestina
AU - Vidal, Maria Esther
N1 - Publisher Copyright: © 2024 The Author(s)
PY - 2024/7/4
Y1 - 2024/7/4
N2 - In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.
AB - In many processes, ranging from medical treatments to supply chains and employee management, there is a growing need to gather information with the objective of enhancing the efficiency of the process in question. Often, the information gathered from different stages of a process resides in disparate storage systems, necessitating an information fusion process. Post-fusion, it is common to encounter data inconsistencies that hinder an accurate analysis. Unfortunately, existing data validation languages lack the capability to model constraints across stages, making it challenging to identify inconsistencies without introducing artificial elements. This paper introduces PALADIN, a language which has been specifically designed to allow the formulation of constraints in the realm of process-based data, i.e., data points that evolve through various stages of a process with constraints that change according to the stage at which a data point is. PALADIN is data model-agnostic, which means it is not specific to any particular data model or format. This paper provides a formalization, together with implementation details of PALADIN validators, and their validation through a use case. Furthermore, PALADIN is subjected to an empirical evaluation across 20 datasets, including 18 synthetically generated ones that are openly shared with the scientific community. The experimentation involves 53 testbeds, and shows that PALADIN reduces the data validation time compared with other languages that are not tailored for process-based data—achieving a speed-up of up to five times. The results also highlight the impact of parameters such as the type of data integration system, the number of integrity constraints, and the dataset size on the validation time of PALADIN shape schemas.
KW - Information fusion
KW - Process-based data
KW - Shape-based validation languages
UR - http://www.scopus.com/inward/record.url?scp=85198036386&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2024.102557
DO - 10.1016/j.inffus.2024.102557
M3 - Article
AN - SCOPUS:85198036386
VL - 112
JO - Information Fusion
JF - Information Fusion
SN - 1566-2535
M1 - 102557
ER -