Transfer learning for molecular property predictions from small datasets

Thorren Kirschbaum; Annika Bande

doi:10.48550/arXiv.2404.13393

Details

Originalsprache	Englisch
Aufsatznummer	105119
Seitenumfang	9
Fachzeitschrift	AIP Advances
Jahrgang	14
Ausgabenummer	10
Frühes Online-Datum	14 Okt. 2024
Publikationsstatus	Veröffentlicht - Okt. 2024

Abstract

Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels’ distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

ASJC Scopus Sachgebiete

Physik und Astronomie (insg.)
Allgemeine Physik und Astronomie

Zitieren

Transfer learning for molecular property predictions from small datasets. / Kirschbaum, Thorren; Bande, Annika.
in: AIP Advances, Jahrgang 14, Nr. 10, 105119, 10.2024.

Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review

Kirschbaum, T & Bande, A 2024, 'Transfer learning for molecular property predictions from small datasets', AIP Advances, Jg. 14, Nr. 10, 105119. https://doi.org/10.48550/arXiv.2404.13393, https://doi.org/10.1063/5.0214754

Kirschbaum, T., & Bande, A. (2024). Transfer learning for molecular property predictions from small datasets. AIP Advances, 14(10), Artikel 105119. https://doi.org/10.48550/arXiv.2404.13393, https://doi.org/10.1063/5.0214754

Kirschbaum T, Bande A. Transfer learning for molecular property predictions from small datasets. AIP Advances. 2024 Okt;14(10):105119. Epub 2024 Okt 14. doi: 10.48550/arXiv.2404.13393, 10.1063/5.0214754

Kirschbaum, Thorren ; Bande, Annika. / Transfer learning for molecular property predictions from small datasets. in: AIP Advances. 2024 ; Jahrgang 14, Nr. 10.

Download

@article{1221151c67ac44be989f65b361f14611,

title = "Transfer learning for molecular property predictions from small datasets",

abstract = "Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels{\textquoteright} distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.",

author = "Thorren Kirschbaum and Annika Bande",

note = "Publisher Copyright: {\textcopyright} 2024 Author(s).",

year = "2024",

month = oct,

doi = "10.48550/arXiv.2404.13393",

language = "English",

volume = "14",

journal = "AIP Advances",

issn = "2158-3226",

publisher = "American Institute of Physics",

number = "10",

}

Download

TY - JOUR

T1 - Transfer learning for molecular property predictions from small datasets

AU - Kirschbaum, Thorren

AU - Bande, Annika

PY - 2024/10

Y1 - 2024/10

N2 - Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels’ distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

AB - Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels’ distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.

UR - http://www.scopus.com/inward/record.url?scp=85206460904&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2404.13393

DO - 10.48550/arXiv.2404.13393

M3 - Article

AN - SCOPUS:85206460904

VL - 14

JO - AIP Advances

JF - AIP Advances

SN - 2158-3226

IS - 10

M1 - 105119

ER -

Research@Leibniz University

Transfer learning for molecular property predictions from small datasets

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren