COCOA: COrrelation COefficient-Aware Data Augmentation

Mahdi Esmailoghli; Jorge-Arnulfo Quiané-Ruiz; Ziawasch Abedjan

doi:10.5441/002/EDBT.2021.30

Details

Originalsprache	Englisch
Titel des Sammelwerks	Proceedings of the 24th International Conference on Extending Database Technology (EDBT)
Herausgeber/-innen	Yannis Velegrakis, Yannis Velegrakis, Demetris Zeinalipour, Panos K. Chrysanthis, Panos K. Chrysanthis, Francesco Guerra
Seiten	331-336
Seitenumfang	6
ISBN (elektronisch)	978-3-89318-084-4
Publikationsstatus	Veröffentlicht - 2021

Publikationsreihe

Name	Advances in database technology
ISSN (elektronisch)	2367-2005

Abstract

Calculating correlation coefficients is one of the most used measures in data science. Although linear correlations are fast and easy to calculate, they lack robustness and effectiveness in the existence of non-linear associations. Rank-based coefficients such as Spearman's are more suitable. However, rank-based measures first require to sort the values and obtain the ranks, making their calculation super-linear. One of the use-cases that is affected by this is data enrichment for Machine Learning (ML) through feature extraction from large databases. Finding the most promising features from millions of candidates to increase the ML accuracy requires billions of correlation calculations. In this paper, we introduce an index structure that ensures rank-based correlation calculation in a linear time. Our solution accelerates the correlation calculation up to 500 times in the data enrichment setting.

ASJC Scopus Sachgebiete

Informatik (insg.)
Software
Informatik (insg.)
Information systems
Informatik (insg.)
Angewandte Informatik

Zitieren

COCOA: COrrelation COefficient-Aware Data Augmentation. / Esmailoghli, Mahdi; Quiané-Ruiz, Jorge-Arnulfo; Abedjan, Ziawasch.
Proceedings of the 24th International Conference on Extending Database Technology (EDBT). Hrsg. / Yannis Velegrakis; Yannis Velegrakis; Demetris Zeinalipour; Panos K. Chrysanthis; Panos K. Chrysanthis; Francesco Guerra. 2021. S. 331-336 (Advances in database technology).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Esmailoghli, M, Quiané-Ruiz, J-A & Abedjan, Z 2021, COCOA: COrrelation COefficient-Aware Data Augmentation. in Y Velegrakis, Y Velegrakis, D Zeinalipour, PK Chrysanthis, PK Chrysanthis & F Guerra (Hrsg.), Proceedings of the 24th International Conference on Extending Database Technology (EDBT). Advances in database technology, S. 331-336. https://doi.org/10.5441/002/EDBT.2021.30

Esmailoghli, M., Quiané-Ruiz, J.-A., & Abedjan, Z. (2021). COCOA: COrrelation COefficient-Aware Data Augmentation. In Y. Velegrakis, Y. Velegrakis, D. Zeinalipour, P. K. Chrysanthis, P. K. Chrysanthis, & F. Guerra (Hrsg.), Proceedings of the 24th International Conference on Extending Database Technology (EDBT) (S. 331-336). (Advances in database technology). https://doi.org/10.5441/002/EDBT.2021.30

Esmailoghli M, Quiané-Ruiz JA, Abedjan Z. COCOA: COrrelation COefficient-Aware Data Augmentation. in Velegrakis Y, Velegrakis Y, Zeinalipour D, Chrysanthis PK, Chrysanthis PK, Guerra F, Hrsg., Proceedings of the 24th International Conference on Extending Database Technology (EDBT). 2021. S. 331-336. (Advances in database technology). doi: 10.5441/002/EDBT.2021.30

Esmailoghli, Mahdi ; Quiané-Ruiz, Jorge-Arnulfo ; Abedjan, Ziawasch. / COCOA: COrrelation COefficient-Aware Data Augmentation. Proceedings of the 24th International Conference on Extending Database Technology (EDBT). Hrsg. / Yannis Velegrakis ; Yannis Velegrakis ; Demetris Zeinalipour ; Panos K. Chrysanthis ; Panos K. Chrysanthis ; Francesco Guerra. 2021. S. 331-336 (Advances in database technology).

Download

@inproceedings{16e40dd7f195492ca51413f928912b24,

title = "COCOA: COrrelation COefficient-Aware Data Augmentation",

abstract = "Calculating correlation coefficients is one of the most used measures in data science. Although linear correlations are fast and easy to calculate, they lack robustness and effectiveness in the existence of non-linear associations. Rank-based coefficients such as Spearman's are more suitable. However, rank-based measures first require to sort the values and obtain the ranks, making their calculation super-linear. One of the use-cases that is affected by this is data enrichment for Machine Learning (ML) through feature extraction from large databases. Finding the most promising features from millions of candidates to increase the ML accuracy requires billions of correlation calculations. In this paper, we introduce an index structure that ensures rank-based correlation calculation in a linear time. Our solution accelerates the correlation calculation up to 500 times in the data enrichment setting.",

author = "Mahdi Esmailoghli and Jorge-Arnulfo Quian{\'e}-Ruiz and Ziawasch Abedjan",

note = "Funding Information: We presented Cocoa, a new data enrichment system. It enables the efficient calculation of non-linear correlation coefficients to select the most correlating features for a user-defined ML task. In particular, we introduced an index structure that allows to calculate non-linear correlation coefficients in linear time complexity. Cocoa is designed to be general and hence it can be complemented with other table-based filters or used for any analytic task that depends on value rankings and rank-based scores. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445 and the German Ministry for Education and Research as BIFOLD — “Berlin Institute for the Foundations of Learning and Data” (01IS18025A and 01IS18037A).",

year = "2021",

doi = "10.5441/002/EDBT.2021.30",

language = "English",

series = "Advances in database technology",

pages = "331--336",

editor = "Yannis Velegrakis and Yannis Velegrakis and Demetris Zeinalipour and Chrysanthis, {Panos K.} and Chrysanthis, {Panos K.} and Francesco Guerra",

booktitle = "Proceedings of the 24th International Conference on Extending Database Technology (EDBT)",

}

Download

TY - GEN

T1 - COCOA: COrrelation COefficient-Aware Data Augmentation

AU - Esmailoghli, Mahdi

AU - Quiané-Ruiz, Jorge-Arnulfo

AU - Abedjan, Ziawasch

N1 - Funding Information: We presented Cocoa, a new data enrichment system. It enables the efficient calculation of non-linear correlation coefficients to select the most correlating features for a user-defined ML task. In particular, we introduced an index structure that allows to calculate non-linear correlation coefficients in linear time complexity. Cocoa is designed to be general and hence it can be complemented with other table-based filters or used for any analytic task that depends on value rankings and rank-based scores. Acknowledgements. This project has been supported by the German Research Foundation (DFG) under grant agreement 387872445 and the German Ministry for Education and Research as BIFOLD — “Berlin Institute for the Foundations of Learning and Data” (01IS18025A and 01IS18037A).

PY - 2021

Y1 - 2021

N2 - Calculating correlation coefficients is one of the most used measures in data science. Although linear correlations are fast and easy to calculate, they lack robustness and effectiveness in the existence of non-linear associations. Rank-based coefficients such as Spearman's are more suitable. However, rank-based measures first require to sort the values and obtain the ranks, making their calculation super-linear. One of the use-cases that is affected by this is data enrichment for Machine Learning (ML) through feature extraction from large databases. Finding the most promising features from millions of candidates to increase the ML accuracy requires billions of correlation calculations. In this paper, we introduce an index structure that ensures rank-based correlation calculation in a linear time. Our solution accelerates the correlation calculation up to 500 times in the data enrichment setting.

AB - Calculating correlation coefficients is one of the most used measures in data science. Although linear correlations are fast and easy to calculate, they lack robustness and effectiveness in the existence of non-linear associations. Rank-based coefficients such as Spearman's are more suitable. However, rank-based measures first require to sort the values and obtain the ranks, making their calculation super-linear. One of the use-cases that is affected by this is data enrichment for Machine Learning (ML) through feature extraction from large databases. Finding the most promising features from millions of candidates to increase the ML accuracy requires billions of correlation calculations. In this paper, we introduce an index structure that ensures rank-based correlation calculation in a linear time. Our solution accelerates the correlation calculation up to 500 times in the data enrichment setting.

UR - http://www.scopus.com/inward/record.url?scp=85108943340&partnerID=8YFLogxK

U2 - 10.5441/002/EDBT.2021.30

DO - 10.5441/002/EDBT.2021.30

M3 - Conference contribution

T3 - Advances in database technology

SP - 331

EP - 336

BT - Proceedings of the 24th International Conference on Extending Database Technology (EDBT)

A2 - Velegrakis, Yannis

A2 - Zeinalipour, Demetris

A2 - Chrysanthis, Panos K.

A2 - Guerra, Francesco

ER -

Research@Leibniz University

COCOA: COrrelation COefficient-Aware Data Augmentation

Autoren

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren