Enabling data-centric AI through data quality management and data literacy

Ziawasch Abedjan

doi:10.1515/itit-2021-0048

Details

Original language	English
Pages (from-to)	67-70
Number of pages	4
Journal	IT - Information Technology
Volume	64
Issue number	1-2
Publication status	Published - 1 Apr 2022

Abstract

Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.

Keywords

Data cleaning, Data discovery, Data literacy, Data preparation, Data profiling, Feature engineering

ASJC Scopus subject areas

Computer Science(all)
General Computer Science

Sustainable Development Goals

SDG 16 - Peace, Justice and Strong Institutions

Cite this

Enabling data-centric AI through data quality management and data literacy. / Abedjan, Ziawasch.
In: IT - Information Technology, Vol. 64, No. 1-2, 01.04.2022, p. 67-70.

Research output: Contribution to journal › Article › Research › peer review

Abedjan, Z 2022, 'Enabling data-centric AI through data quality management and data literacy', IT - Information Technology, vol. 64, no. 1-2, pp. 67-70. https://doi.org/10.1515/itit-2021-0048

Abedjan, Z. (2022). Enabling data-centric AI through data quality management and data literacy. IT - Information Technology, 64(1-2), 67-70. https://doi.org/10.1515/itit-2021-0048

Abedjan Z. Enabling data-centric AI through data quality management and data literacy. IT - Information Technology. 2022 Apr 1;64(1-2):67-70. doi: 10.1515/itit-2021-0048

Abedjan, Ziawasch. / Enabling data-centric AI through data quality management and data literacy. In: IT - Information Technology. 2022 ; Vol. 64, No. 1-2. pp. 67-70.

Download

@article{d7e623872cd243dd8e762bc8ce6a77a6,

title = "Enabling data-centric AI through data quality management and data literacy",

abstract = "Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.",

keywords = "Data cleaning, Data discovery, Data literacy, Data preparation, Data profiling, Feature engineering",

author = "Ziawasch Abedjan",

year = "2022",

month = apr,

day = "1",

doi = "10.1515/itit-2021-0048",

language = "English",

volume = "64",

pages = "67--70",

number = "1-2",

}

Download

TY - JOUR

T1 - Enabling data-centric AI through data quality management and data literacy

AU - Abedjan, Ziawasch

PY - 2022/4/1

Y1 - 2022/4/1

N2 - Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.

AB - Data is being produced at an intractable pace. At the same time, there is an insatiable interest in using such data for use cases that span all imaginable domains, including health, climate, business, and gaming. Beyond the novel socio-technical challenges that surround data-driven innovations, there are still open data processing challenges that impede the usability of data-driven techniques. It is commonly acknowledged that overcoming heterogeneity of data with regard to syntax and semantics to combine various sources for a common goal is a major bottleneck. Furthermore, the quality of such data is always under question as the data science pipelines today are highly ad-hoc and without the necessary care for provenance. Finally, quality criteria that go beyond the syntactical and semantic correctness of individual values but also incorporate population-level constraints, such as equal parity and opportunity with regard to protected groups, play a more and more important role in this process. Traditional research on data integration was focused on post-merger integration of companies, where customer or product databases had to be integrated. While this is often hard enough, today the challenges aggravate because of the fact that more stakeholders are using data analytics tools to derive domain-specific insights. I call this phenomenon the democratization of data science, a process, which is both challenging and necessary. Novel systems need to be user-friendly in a way that not only trained database admins can handle them but also less computer science savvy stakeholders. Thus, our research focuses on scalable example-driven techniques for data preparation and curation. Furthermore, we believe that it is important to educate the breadth of society on implications of a data-driven world and actively promote the concept of data literacy as a fundamental competence.

KW - Data cleaning

KW - Data discovery

KW - Data literacy

KW - Data preparation

KW - Data profiling

KW - Feature engineering

UR - http://www.scopus.com/inward/record.url?scp=85126048406&partnerID=8YFLogxK

U2 - 10.1515/itit-2021-0048

DO - 10.1515/itit-2021-0048

M3 - Article

AN - SCOPUS:85126048406

VL - 64

SP - 67

EP - 70

JO - IT - Information Technology

JF - IT - Information Technology

SN - 1611-2776

IS - 1-2

ER -

Research@Leibniz University