GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Soumyadeep Roy
  • Jonas Wallat
  • Sowmya S. Sundaram
  • Wolfgang Nejdl
  • Niloy Ganguly

Organisationseinheiten

Externe Organisationen

  • Indian Institute of Technology Kharagpur (IITKGP)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksECAI 2023
Untertitel26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023
Herausgeber/-innenKobi Gal, Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, Roxana Radulescu
Seiten2002-2009
Seitenumfang8
ISBN (elektronisch)9781643684369
PublikationsstatusVeröffentlicht - 2023
Veranstaltung26th European Conference on Artificial Intelligence, ECAI 2023 - Krakow, Polen
Dauer: 30 Sept. 20234 Okt. 2023

Publikationsreihe

NameFrontiers in Artificial Intelligence and Applications
Band372
ISSN (Print)0922-6389
ISSN (elektronisch)1879-8314

Abstract

Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

ASJC Scopus Sachgebiete

Zitieren

GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. / Roy, Soumyadeep; Wallat, Jonas; Sundaram, Sowmya S. et al.
ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Hrsg. / Kobi Gal; Kobi Gal; Ann Nowe; Grzegorz J. Nalepa; Roy Fairstein; Roxana Radulescu. 2023. S. 2002-2009 (Frontiers in Artificial Intelligence and Applications; Band 372).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Roy, S, Wallat, J, Sundaram, SS, Nejdl, W & Ganguly, N 2023, GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in K Gal, K Gal, A Nowe, GJ Nalepa, R Fairstein & R Radulescu (Hrsg.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Frontiers in Artificial Intelligence and Applications, Bd. 372, S. 2002-2009, 26th European Conference on Artificial Intelligence, ECAI 2023, Krakow, Polen, 30 Sept. 2023. https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492
Roy, S., Wallat, J., Sundaram, S. S., Nejdl, W., & Ganguly, N. (2023). GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. In K. Gal, K. Gal, A. Nowe, G. J. Nalepa, R. Fairstein, & R. Radulescu (Hrsg.), ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023 (S. 2002-2009). (Frontiers in Artificial Intelligence and Applications; Band 372). https://doi.org/10.48550/arXiv.2307.15933, https://doi.org/10.3233/FAIA230492
Roy S, Wallat J, Sundaram SS, Nejdl W, Ganguly N. GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. in Gal K, Gal K, Nowe A, Nalepa GJ, Fairstein R, Radulescu R, Hrsg., ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. 2023. S. 2002-2009. (Frontiers in Artificial Intelligence and Applications). doi: 10.48550/arXiv.2307.15933, 10.3233/FAIA230492
Roy, Soumyadeep ; Wallat, Jonas ; Sundaram, Sowmya S. et al. / GENEMASK : Fast Pretraining of Gene Sequences to Enable Few-Shot Learning. ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. Hrsg. / Kobi Gal ; Kobi Gal ; Ann Nowe ; Grzegorz J. Nalepa ; Roy Fairstein ; Roxana Radulescu. 2023. S. 2002-2009 (Frontiers in Artificial Intelligence and Applications).
Download
@inproceedings{8e00f339c30644bfae85f941d3384ce8,
title = "GENEMASK: Fast Pretraining of Gene Sequences to Enable Few-Shot Learning",
abstract = "Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.",
author = "Soumyadeep Roy and Jonas Wallat and Sundaram, {Sowmya S.} and Wolfgang Nejdl and Niloy Ganguly",
note = "Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003. ; 26th European Conference on Artificial Intelligence, ECAI 2023 ; Conference date: 30-09-2023 Through 04-10-2023",
year = "2023",
doi = "10.48550/arXiv.2307.15933",
language = "English",
series = "Frontiers in Artificial Intelligence and Applications",
pages = "2002--2009",
editor = "Kobi Gal and Kobi Gal and Ann Nowe and Nalepa, {Grzegorz J.} and Roy Fairstein and Roxana Radulescu",
booktitle = "ECAI 2023",

}

Download

TY - GEN

T1 - GENEMASK

T2 - 26th European Conference on Artificial Intelligence, ECAI 2023

AU - Roy, Soumyadeep

AU - Wallat, Jonas

AU - Sundaram, Sowmya S.

AU - Nejdl, Wolfgang

AU - Ganguly, Niloy

N1 - Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003.

PY - 2023

Y1 - 2023

N2 - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

AB - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.

UR - http://www.scopus.com/inward/record.url?scp=85175804493&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2307.15933

DO - 10.48550/arXiv.2307.15933

M3 - Conference contribution

AN - SCOPUS:85175804493

T3 - Frontiers in Artificial Intelligence and Applications

SP - 2002

EP - 2009

BT - ECAI 2023

A2 - Gal, Kobi

A2 - Gal, Kobi

A2 - Nowe, Ann

A2 - Nalepa, Grzegorz J.

A2 - Fairstein, Roy

A2 - Radulescu, Roxana

Y2 - 30 September 2023 through 4 October 2023

ER -

Von denselben Autoren