GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Soumyadeep Roy; Jonas Wallat; Sowmya S. Sundaram; Wolfgang Nejdl; Niloy Ganguly

doi:10.48550/arXiv.2307.15933

Details

Originalsprache	Englisch
Titel des Sammelwerks	ECAI 2023
Untertitel	26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023
Herausgeber/-innen	Kobi Gal, Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, Roxana Radulescu
Seiten	2002-2009
Seitenumfang	8
ISBN (elektronisch)	9781643684369
Publikationsstatus	Veröffentlicht - 2023
Veranstaltung	26th European Conference on Artificial Intelligence, ECAI 2023 - Krakow, Polen Dauer: 30 Sept. 2023 → 4 Okt. 2023

Publikationsreihe

Name	Frontiers in Artificial Intelligence and Applications
Band	372
ISSN (Print)	0922-6389
ISSN (elektronisch)	1879-8314

Abstract

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

ASJC Scopus Sachgebiete

Informatik (insg.)
Artificial intelligence

Zitieren

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. / Roy, Soumyadeep; Wallat, Jonas; Sundaram, Sowmya S. et al.
ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Hrsg. / Kobi Gal; Kobi Gal; Ann Nowe; Grzegorz J. Nalepa; Roy Fairstein; Roxana Radulescu. 2023. S. 2002-2009 (Frontiers in Artificial Intelligence and Applications; Band 372).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Roy, S, Wallat, J, Sundaram, SS, Nejdl, W & Ganguly, N 2023, GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in K Gal, K Gal, A Nowe, GJ Nalepa, R Fairstein & R Radulescu (Hrsg.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Frontiers in Artificial Intelligence and Applications, Bd. 372, S. 2002-2009, 26th European Conference on Artificial Intelligence, ECAI 2023, Krakow, Polen, 30 Sept. 2023. https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492

Roy, S., Wallat, J., Sundaram, S. S., Nejdl, W., & Ganguly, N. (2023). GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In K. Gal, K. Gal, A. Nowe, G. J. Nalepa, R. Fairstein, & R. Radulescu (Hrsg.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023 (S. 2002-2009). (Frontiers in Artificial Intelligence and Applications; Band 372). https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492

Roy S, Wallat J, Sundaram SS, Nejdl W, Ganguly N. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in Gal K, Gal K, Nowe A, Nalepa GJ, Fairstein R, Radulescu R, Hrsg., ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. 2023. S. 2002-2009. (Frontiers in Artificial Intelligence and Applications). doi: 10.48550/arXiv.2307.15933, 10.3233/FAIA230492

Roy, Soumyadeep ; Wallat, Jonas ; Sundaram, Sowmya S. et al. / GENEMASK : Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Hrsg. / Kobi Gal ; Kobi Gal ; Ann Nowe ; Grzegorz J. Nalepa ; Roy Fairstein ; Roxana Radulescu. 2023. S. 2002-2009 (Frontiers in Artificial Intelligence and Applications).

Download

@inproceedings{8e00f339c30644bfae85f941d3384ce8,

title = "GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning",

abstract = "Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.",

author = "Soumyadeep Roy and Jonas Wallat and Sundaram, {Sowmya S.} and Wolfgang Nejdl and Niloy Ganguly",

note = "Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003. ; 26th European Conference on Artificial Intelligence, ECAI 2023 ; Conference date: 30-09-2023 Through 04-10-2023",

year = "2023",

doi = "10.48550/arXiv.2307.15933",

language = "English",

series = "Frontiers in Artificial Intelligence and Applications",

pages = "2002--2009",

editor = "Kobi Gal and Kobi Gal and Ann Nowe and Nalepa, {Grzegorz J.} and Roy Fairstein and Roxana Radulescu",

booktitle = "ECAI 2023",

}

Download

TY - GEN

T1 - GENEMASK

T2 - 26th European Conference on Artificial Intelligence, ECAI 2023

AU - Roy, Soumyadeep

AU - Wallat, Jonas

AU - Sundaram, Sowmya S.

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

N1 - Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003.

PY - 2023

Y1 - 2023

N2 - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

AB - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

UR - http://www.scopus.com/inward/record.url?scp=85175804493&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2307.15933

DO - 10.48550/arXiv.2307.15933

M3 - Conference contribution

AN - SCOPUS:85175804493

T3 - Frontiers in Artificial Intelligence and Applications

SP - 2002

EP - 2009

BT - ECAI 2023

A2 - Gal, Kobi

A2 - Nowe, Ann

A2 - Nalepa, Grzegorz J.

A2 - Fairstein, Roy

A2 - Radulescu, Roxana

Y2 - 30 September 2023 through 4 October 2023

ER -

Research@Leibniz University

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Autoren

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Causal Probing for Dual Encoders

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions