GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Soumyadeep Roy; Jonas Wallat; Sowmya S. Sundaram; Wolfgang Nejdl; Niloy Ganguly

doi:10.48550/arXiv.2307.15933

Details

Original language	English
Title of host publication	ECAI 2023
Subtitle of host publication	26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023
Editors	Kobi Gal, Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, Roxana Radulescu
Pages	2002-2009
Number of pages	8
ISBN (electronic)	9781643684369
Publication status	Published - 2023
Event	26th European Conference on Artificial Intelligence, ECAI 2023 - Krakow, Poland Duration: 30 Sept 2023 → 4 Oct 2023

Publication series

Name	Frontiers in Artificial Intelligence and Applications
Volume	372
ISSN (Print)	0922-6389
ISSN (electronic)	1879-8314

Abstract

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

ASJC Scopus subject areas

Computer Science(all)
Artificial Intelligence

Cite this

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. / Roy, Soumyadeep; Wallat, Jonas; Sundaram, Sowmya S. et al.
ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. ed. / Kobi Gal; Kobi Gal; Ann Nowe; Grzegorz J. Nalepa; Roy Fairstein; Roxana Radulescu. 2023. p. 2002-2009 (Frontiers in Artificial Intelligence and Applications; Vol. 372).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Roy, S, Wallat, J, Sundaram, SS, Nejdl, W & Ganguly, N 2023, GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in K Gal, K Gal, A Nowe, GJ Nalepa, R Fairstein & R Radulescu (eds), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Frontiers in Artificial Intelligence and Applications, vol. 372, pp. 2002-2009, 26th European Conference on Artificial Intelligence, ECAI 2023, Krakow, Poland, 30 Sept 2023. https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492

Roy, S., Wallat, J., Sundaram, S. S., Nejdl, W., & Ganguly, N. (2023). GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In K. Gal, K. Gal, A. Nowe, G. J. Nalepa, R. Fairstein, & R. Radulescu (Eds.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023 (pp. 2002-2009). (Frontiers in Artificial Intelligence and Applications; Vol. 372). https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492

Roy S, Wallat J, Sundaram SS, Nejdl W, Ganguly N. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In Gal K, Gal K, Nowe A, Nalepa GJ, Fairstein R, Radulescu R, editors, ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. 2023. p. 2002-2009. (Frontiers in Artificial Intelligence and Applications). doi: 10.48550/arXiv.2307.15933, 10.3233/FAIA230492

Roy, Soumyadeep ; Wallat, Jonas ; Sundaram, Sowmya S. et al. / GENEMASK : Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. editor / Kobi Gal ; Kobi Gal ; Ann Nowe ; Grzegorz J. Nalepa ; Roy Fairstein ; Roxana Radulescu. 2023. pp. 2002-2009 (Frontiers in Artificial Intelligence and Applications).

Download

@inproceedings{8e00f339c30644bfae85f941d3384ce8,

title = "GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning",

abstract = "Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.",

author = "Soumyadeep Roy and Jonas Wallat and Sundaram, {Sowmya S.} and Wolfgang Nejdl and Niloy Ganguly",

note = "Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003. ; 26th European Conference on Artificial Intelligence, ECAI 2023 ; Conference date: 30-09-2023 Through 04-10-2023",

year = "2023",

doi = "10.48550/arXiv.2307.15933",

language = "English",

series = "Frontiers in Artificial Intelligence and Applications",

pages = "2002--2009",

editor = "Kobi Gal and Kobi Gal and Ann Nowe and Nalepa, {Grzegorz J.} and Roy Fairstein and Roxana Radulescu",

booktitle = "ECAI 2023",

}

Download

TY - GEN

T1 - GENEMASK

T2 - 26th European Conference on Artificial Intelligence, ECAI 2023

AU - Roy, Soumyadeep

AU - Wallat, Jonas

AU - Sundaram, Sowmya S.

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

N1 - Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003.

PY - 2023

Y1 - 2023

N2 - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

AB - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

UR - http://www.scopus.com/inward/record.url?scp=85175804493&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2307.15933

DO - 10.48550/arXiv.2307.15933

M3 - Conference contribution

AN - SCOPUS:85175804493

T3 - Frontiers in Artificial Intelligence and Applications

SP - 2002

EP - 2009

BT - ECAI 2023

A2 - Gal, Kobi

A2 - Nowe, Ann

A2 - Nalepa, Grzegorz J.

A2 - Fairstein, Roy

A2 - Radulescu, Roxana

Y2 - 30 September 2023 through 4 October 2023

ER -

Research@Leibniz University

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

ASJC Scopus subject areas

Cite this

By the same author(s)

Causal Probing for Dual Encoders

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Causal Probing for Dual Encoders

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

Open benchmark for filtering techniques in entity resolution

Adaptive Dispatching of Mobile Charging Stations using Multi-Agent Graph Convolutional Cooperative-Competitive Reinforcement Learning

Robust Fusion of Time Series and Image Data for Improved Multimodal Clinical Prediction

Causal Probing for Dual Encoders