Details
Original language | English |
---|---|
Article number | e1452 |
Number of pages | 59 |
Journal | Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery |
Volume | 12 |
Issue number | 3 |
Early online date | 3 Mar 2022 |
Publication status | Published - 13 May 2022 |
Abstract
As decision-making increasingly relies on machine learning (ML) and (big) data, the issue of fairness in data-driven artificial intelligence systems is receiving increasing attention from both research and industry. A large variety of fairness-aware ML solutions have been proposed which involve fairness-related interventions in the data, learning algorithms, and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware ML. We focus on tabular data as the most common data representation for fairness-aware ML. We start our analysis by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate interesting relationships using exploratory analysis. This article is categorized under: Commercial, Legal, and Ethical Issues > Fairness in Data Mining Fundamental Concepts of Data and Knowledge > Data Concepts Technologies > Data Preprocessing.
Keywords
- benchmark datasets, bias, datasets for fairness, discrimination, fairness-aware machine learning
ASJC Scopus subject areas
- Computer Science(all)
- General Computer Science
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 12, No. 3, e1452, 13.05.2022.
Research output: Contribution to journal › Review article › Research › peer review
}
TY - JOUR
T1 - A survey on datasets for fairness-aware machine learning
AU - Le Quy, Tai
AU - Roy, Arjun
AU - Iosifidis, Vasileios
AU - Zhang, Wenbin
AU - Ntoutsi, Eirini
N1 - Funding Information: The work of the first author is supported by the Ministry of Science and Culture of Lower Saxony, Germany, within the PhD program “LernMINT: Data-assisted teaching in the MINT subjects.” The work of the second author is supported by the Volkswagen Foundation under the call “Artificial Intelligence and the Society of the Future” (the BIAS project).
PY - 2022/5/13
Y1 - 2022/5/13
N2 - As decision-making increasingly relies on machine learning (ML) and (big) data, the issue of fairness in data-driven artificial intelligence systems is receiving increasing attention from both research and industry. A large variety of fairness-aware ML solutions have been proposed which involve fairness-related interventions in the data, learning algorithms, and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware ML. We focus on tabular data as the most common data representation for fairness-aware ML. We start our analysis by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate interesting relationships using exploratory analysis. This article is categorized under: Commercial, Legal, and Ethical Issues > Fairness in Data Mining Fundamental Concepts of Data and Knowledge > Data Concepts Technologies > Data Preprocessing.
AB - As decision-making increasingly relies on machine learning (ML) and (big) data, the issue of fairness in data-driven artificial intelligence systems is receiving increasing attention from both research and industry. A large variety of fairness-aware ML solutions have been proposed which involve fairness-related interventions in the data, learning algorithms, and/or model outputs. However, a vital part of proposing new approaches is evaluating them empirically on benchmark datasets that represent realistic and diverse settings. Therefore, in this paper, we overview real-world datasets used for fairness-aware ML. We focus on tabular data as the most common data representation for fairness-aware ML. We start our analysis by identifying relationships between the different attributes, particularly with respect to protected attributes and class attribute, using a Bayesian network. For a deeper understanding of bias in the datasets, we investigate interesting relationships using exploratory analysis. This article is categorized under: Commercial, Legal, and Ethical Issues > Fairness in Data Mining Fundamental Concepts of Data and Knowledge > Data Concepts Technologies > Data Preprocessing.
KW - benchmark datasets
KW - bias
KW - datasets for fairness
KW - discrimination
KW - fairness-aware machine learning
UR - http://www.scopus.com/inward/record.url?scp=85125031530&partnerID=8YFLogxK
U2 - 10.1002/widm.1452
DO - 10.1002/widm.1452
M3 - Review article
AN - SCOPUS:85125031530
VL - 12
JO - Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
JF - Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
SN - 1942-4787
IS - 3
M1 - e1452
ER -