GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • Indian Institute of Technology Kharagpur (IITKGP)
View graph of relations

Details

Original languageEnglish
Title of host publicationECAI 2023
Subtitle of host publication26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023
EditorsKobi Gal, Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, Roxana Radulescu
Pages2002-2009
Number of pages8
ISBN (electronic)9781643684369
Publication statusPublished - 2023
Event26th European Conference on Artificial Intelligence, ECAI 2023 - Krakow, Poland
Duration: 30 Sept 20234 Oct 2023

Publication series

NameFrontiers in Artificial Intelligence and Applications
Volume372
ISSN (Print)0922-6389
ISSN (electronic)1879-8314

Abstract

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

ASJC Scopus subject areas

Cite this

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. / Roy, Soumyadeep; Wallat, Jonas; Sundaram, Sowmya S. et al.
ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. ed. / Kobi Gal; Kobi Gal; Ann Nowe; Grzegorz J. Nalepa; Roy Fairstein; Roxana Radulescu. 2023. p. 2002-2009 (Frontiers in Artificial Intelligence and Applications; Vol. 372).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Roy, S, Wallat, J, Sundaram, SS, Nejdl, W & Ganguly, N 2023, GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in K Gal, K Gal, A Nowe, GJ Nalepa, R Fairstein & R Radulescu (eds), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Frontiers in Artificial Intelligence and Applications, vol. 372, pp. 2002-2009, 26th European Conference on Artificial Intelligence, ECAI 2023, Krakow, Poland, 30 Sept 2023. https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492
Roy, S., Wallat, J., Sundaram, S. S., Nejdl, W., & Ganguly, N. (2023). GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In K. Gal, K. Gal, A. Nowe, G. J. Nalepa, R. Fairstein, & R. Radulescu (Eds.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023 (pp. 2002-2009). (Frontiers in Artificial Intelligence and Applications; Vol. 372). https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492
Roy S, Wallat J, Sundaram SS, Nejdl W, Ganguly N. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In Gal K, Gal K, Nowe A, Nalepa GJ, Fairstein R, Radulescu R, editors, ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. 2023. p. 2002-2009. (Frontiers in Artificial Intelligence and Applications). doi: 10.48550/arXiv.2307.15933, 10.3233/FAIA230492
Roy, Soumyadeep ; Wallat, Jonas ; Sundaram, Sowmya S. et al. / GENEMASK : Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. editor / Kobi Gal ; Kobi Gal ; Ann Nowe ; Grzegorz J. Nalepa ; Roy Fairstein ; Roxana Radulescu. 2023. pp. 2002-2009 (Frontiers in Artificial Intelligence and Applications).
Download
@inproceedings{8e00f339c30644bfae85f941d3384ce8,
title = "GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning",
abstract = "Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.",
author = "Soumyadeep Roy and Jonas Wallat and Sundaram, {Sowmya S.} and Wolfgang Nejdl and Niloy Ganguly",
note = "Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003. ; 26th European Conference on Artificial Intelligence, ECAI 2023 ; Conference date: 30-09-2023 Through 04-10-2023",
year = "2023",
doi = "10.48550/arXiv.2307.15933",
language = "English",
series = "Frontiers in Artificial Intelligence and Applications",
pages = "2002--2009",
editor = "Kobi Gal and Kobi Gal and Ann Nowe and Nalepa, {Grzegorz J.} and Roy Fairstein and Roxana Radulescu",
booktitle = "ECAI 2023",

}

Download

TY - GEN

T1 - GENEMASK

T2 - 26th European Conference on Artificial Intelligence, ECAI 2023

AU - Roy, Soumyadeep

AU - Wallat, Jonas

AU - Sundaram, Sowmya S.

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

N1 - Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003.

PY - 2023

Y1 - 2023

N2 - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

AB - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

UR - http://www.scopus.com/inward/record.url?scp=85175804493&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2307.15933

DO - 10.48550/arXiv.2307.15933

M3 - Conference contribution

AN - SCOPUS:85175804493

T3 - Frontiers in Artificial Intelligence and Applications

SP - 2002

EP - 2009

BT - ECAI 2023

A2 - Gal, Kobi

A2 - Gal, Kobi

A2 - Nowe, Ann

A2 - Nalepa, Grzegorz J.

A2 - Fairstein, Roy

A2 - Radulescu, Roxana

Y2 - 30 September 2023 through 4 October 2023

ER -

By the same author(s)