Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Siamak Ghodsi
  • Eirini Ntoutsi

Research Organisations

External Research Organisations

  • Freie Universität Berlin (FU Berlin)
  • Universität der Bundeswehr München
View graph of relations

Details

Original languageEnglish
Title of host publicationEWAF`23
Subtitle of host publicationEuropean Workshop on Algorithmic Fairness
Publication statusPublished - 16 Jul 2023
Event2nd European Workshop on Algorithmic Fairness, EWAF 2023 - Winterthur, Switzerland
Duration: 7 Jun 20239 Jun 2023

Publication series

NameCEUR Workshop Proceedings
PublisherCEUR Workshop Proceedings
Volume3442
ISSN (Print)1613-0073

Abstract

Group imbalance usually caused by insufficient or unrepresentative data collection procedures, is among the main reasons for the emergence of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in case if a learning model is trained on such biased data. In this paper, we propose MASC a data augmentation approach based on affinity clustering of existing data in similar datasets. An arbitrary target dataset utilizes protected group instances of other neighboring datasets that locate in the same cluster, in order to balance out the cardinality of its non-protected and protected groups. To form clusters where datasets can share instances for protected-group augmentation, an affinity clustering pipeline is developed based on an affinity matrix. The formation of the affinity matrix relies on computing the discrepancy of distributions between each pair of datasets and translating these discrepancies into a symmetric pairwise similarity matrix. Furthermore, a non-parametric spectral clustering is applied to the affinity matrix and the corresponding datasets are categorized into an optimal number of clusters automatically. We perform a step-by-step experiment as a demo of our method to both show the procedure of the proposed data augmentation method and also to evaluate and discuss its performance. In addition, a comparison to other data augmentation methods before and after the augmentations are provided as well as model evaluation performance analysis of each of the competitors compared to our method. In our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.

Keywords

    Affinity Clustering, Bias & Fairness, Data augmentation, Data Debiasing, Distribution Shift, Maximum Mean Discrepancy

ASJC Scopus subject areas

Cite this

Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy. / Ghodsi, Siamak; Ntoutsi, Eirini.
EWAF`23: European Workshop on Algorithmic Fairness. 2023. (CEUR Workshop Proceedings; Vol. 3442).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Ghodsi, S & Ntoutsi, E 2023, Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy. in EWAF`23: European Workshop on Algorithmic Fairness. CEUR Workshop Proceedings, vol. 3442, 2nd European Workshop on Algorithmic Fairness, EWAF 2023, Winterthur, Switzerland, 7 Jun 2023. <http://CEUR-WS.org/Vol-3442/paper-10.pdf>
Ghodsi, S., & Ntoutsi, E. (2023). Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy. In EWAF`23: European Workshop on Algorithmic Fairness (CEUR Workshop Proceedings; Vol. 3442). http://CEUR-WS.org/Vol-3442/paper-10.pdf
Ghodsi S, Ntoutsi E. Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy. In EWAF`23: European Workshop on Algorithmic Fairness. 2023. (CEUR Workshop Proceedings).
Ghodsi, Siamak ; Ntoutsi, Eirini. / Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy. EWAF`23: European Workshop on Algorithmic Fairness. 2023. (CEUR Workshop Proceedings).
Download
@inproceedings{1bd7c7b2b6bb4ed9aa35a1444466d2e2,
title = "Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy",
abstract = "Group imbalance usually caused by insufficient or unrepresentative data collection procedures, is among the main reasons for the emergence of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in case if a learning model is trained on such biased data. In this paper, we propose MASC a data augmentation approach based on affinity clustering of existing data in similar datasets. An arbitrary target dataset utilizes protected group instances of other neighboring datasets that locate in the same cluster, in order to balance out the cardinality of its non-protected and protected groups. To form clusters where datasets can share instances for protected-group augmentation, an affinity clustering pipeline is developed based on an affinity matrix. The formation of the affinity matrix relies on computing the discrepancy of distributions between each pair of datasets and translating these discrepancies into a symmetric pairwise similarity matrix. Furthermore, a non-parametric spectral clustering is applied to the affinity matrix and the corresponding datasets are categorized into an optimal number of clusters automatically. We perform a step-by-step experiment as a demo of our method to both show the procedure of the proposed data augmentation method and also to evaluate and discuss its performance. In addition, a comparison to other data augmentation methods before and after the augmentations are provided as well as model evaluation performance analysis of each of the competitors compared to our method. In our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.",
keywords = "Affinity Clustering, Bias & Fairness, Data augmentation, Data Debiasing, Distribution Shift, Maximum Mean Discrepancy",
author = "Siamak Ghodsi and Eirini Ntoutsi",
note = "Funding Information: This work has received funding from the European Union{\textquoteright}s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie Actions (grant agreement number 860630) for the project {\textquoteleft}{\textquoteright}NoBIAS - Artificial Intelligence without Bias{\textquoteright}{\textquoteright}. This work reflects only the authors{\textquoteright} views and the European Research Executive Agency (REA) is not responsible for any use that may be made of the information it contains. ; 2nd European Workshop on Algorithmic Fairness, EWAF 2023 ; Conference date: 07-06-2023 Through 09-06-2023",
year = "2023",
month = jul,
day = "16",
language = "English",
series = "CEUR Workshop Proceedings",
publisher = "CEUR Workshop Proceedings",
booktitle = "EWAF`23",

}

Download

TY - GEN

T1 - Affinity Clustering Framework for Data Debiasing Using Pairwise Distribution Discrepancy

AU - Ghodsi, Siamak

AU - Ntoutsi, Eirini

N1 - Funding Information: This work has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Sklodowska-Curie Actions (grant agreement number 860630) for the project ‘’NoBIAS - Artificial Intelligence without Bias’’. This work reflects only the authors’ views and the European Research Executive Agency (REA) is not responsible for any use that may be made of the information it contains.

PY - 2023/7/16

Y1 - 2023/7/16

N2 - Group imbalance usually caused by insufficient or unrepresentative data collection procedures, is among the main reasons for the emergence of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in case if a learning model is trained on such biased data. In this paper, we propose MASC a data augmentation approach based on affinity clustering of existing data in similar datasets. An arbitrary target dataset utilizes protected group instances of other neighboring datasets that locate in the same cluster, in order to balance out the cardinality of its non-protected and protected groups. To form clusters where datasets can share instances for protected-group augmentation, an affinity clustering pipeline is developed based on an affinity matrix. The formation of the affinity matrix relies on computing the discrepancy of distributions between each pair of datasets and translating these discrepancies into a symmetric pairwise similarity matrix. Furthermore, a non-parametric spectral clustering is applied to the affinity matrix and the corresponding datasets are categorized into an optimal number of clusters automatically. We perform a step-by-step experiment as a demo of our method to both show the procedure of the proposed data augmentation method and also to evaluate and discuss its performance. In addition, a comparison to other data augmentation methods before and after the augmentations are provided as well as model evaluation performance analysis of each of the competitors compared to our method. In our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.

AB - Group imbalance usually caused by insufficient or unrepresentative data collection procedures, is among the main reasons for the emergence of representation bias in datasets. Representation bias can exist with respect to different groups of one or more protected attributes and might lead to prejudicial and discriminatory outcomes toward certain groups of individuals; in case if a learning model is trained on such biased data. In this paper, we propose MASC a data augmentation approach based on affinity clustering of existing data in similar datasets. An arbitrary target dataset utilizes protected group instances of other neighboring datasets that locate in the same cluster, in order to balance out the cardinality of its non-protected and protected groups. To form clusters where datasets can share instances for protected-group augmentation, an affinity clustering pipeline is developed based on an affinity matrix. The formation of the affinity matrix relies on computing the discrepancy of distributions between each pair of datasets and translating these discrepancies into a symmetric pairwise similarity matrix. Furthermore, a non-parametric spectral clustering is applied to the affinity matrix and the corresponding datasets are categorized into an optimal number of clusters automatically. We perform a step-by-step experiment as a demo of our method to both show the procedure of the proposed data augmentation method and also to evaluate and discuss its performance. In addition, a comparison to other data augmentation methods before and after the augmentations are provided as well as model evaluation performance analysis of each of the competitors compared to our method. In our experiments, bias is measured in a non-binary protected attribute setup w.r.t. racial groups distribution for two separate minority groups in comparison with the majority group before and after debiasing. Empirical results imply that our method of augmenting dataset biases using real (genuine) data from similar contexts can effectively debias the target datasets comparably to existing data augmentation strategies.

KW - Affinity Clustering

KW - Bias & Fairness

KW - Data augmentation

KW - Data Debiasing

KW - Distribution Shift

KW - Maximum Mean Discrepancy

UR - http://www.scopus.com/inward/record.url?scp=85168309574&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85168309574

T3 - CEUR Workshop Proceedings

BT - EWAF`23

T2 - 2nd European Workshop on Algorithmic Fairness, EWAF 2023

Y2 - 7 June 2023 through 9 June 2023

ER -