Details
Original language | English |
---|---|
Title of host publication | ECAI 2023 |
Subtitle of host publication | 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023 |
Editors | Kobi Gal, Kobi Gal, Ann Nowe, Grzegorz J. Nalepa, Roy Fairstein, Roxana Radulescu |
Pages | 2002-2009 |
Number of pages | 8 |
ISBN (electronic) | 9781643684369 |
Publication status | Published - 2023 |
Event | 26th European Conference on Artificial Intelligence, ECAI 2023 - Krakow, Poland Duration: 30 Sept 2023 → 4 Oct 2023 |
Publication series
Name | Frontiers in Artificial Intelligence and Applications |
---|---|
Volume | 372 |
ISSN (Print) | 0922-6389 |
ISSN (electronic) | 1879-8314 |
Abstract
Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.
ASJC Scopus subject areas
- Computer Science(all)
- Artificial Intelligence
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
ECAI 2023 : 26th European Conference on Artificial Intelligence, including 12th Conference on Prestigious Applications of Intelligent Systems, PAIS 2023. ed. / Kobi Gal; Kobi Gal; Ann Nowe; Grzegorz J. Nalepa; Roy Fairstein; Roxana Radulescu. 2023. p. 2002-2009 (Frontiers in Artificial Intelligence and Applications; Vol. 372).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - GENEMASK
T2 - 26th European Conference on Artificial Intelligence, ECAI 2023
AU - Roy, Soumyadeep
AU - Wallat, Jonas
AU - Sundaram, Sowmya S.
AU - Nejdl, Wolfgang
AU - Ganguly, Niloy
N1 - Funding Information: Soumyadeep Roy is supported by the Institute Ph.D. Fellowship at the Indian Institute of Technology Kharagpur. Soumyadeep Roy and Niloy Ganguly were also affiliated with L3S Research Center, Germany while conducting this work. This research was funded by the Federal Ministry of Education and Research (BMBF), Germany, under the project LeibnizKILabor with grant No. 01DD20003.
PY - 2023
Y1 - 2023
N2 - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.
AB - Large-scale language models such as DNABert and LOGO aim to learn optimal gene representations and are trained on the entire Human Reference Genome. However, standard tokenization schemes involve a simple sliding window of tokens like k-mers that do not leverage any gene-based semantics and thus may lead to (trivial) masking of easily predictable sequences, and subsequently inefficient Masked Language Modeling (MLM) training. Therefore, we propose a novel masking algorithm, GENEMASK, for MLM training of gene sequences, where we randomly identify positions in a gene sequence as mask centers and locally select the span around the mask center with the highest Normalized Pointwise Mutual Information (NPMI) to mask. We observe that in the absence of human-understandable semantics in the genomics domain (in contrast, semantic units like words and phrases are inherently available in NLP), GENEMASK-based models substantially outperform the SOTA models (DNABert and LOGO) over four benchmark gene sequence classification datasets in five few-shot settings (10 to 1000-shot). More significantly, the GENEMASK-based DNABert model is trained for less than one-tenth of the number of epochs of the original SOTA model. We also observe a strong correlation between top-ranked PMI tokens and conserved DNA sequence motifs, which may indicate the incorporation of latent genomic information. The codes (including trained models) and datasets are made publicly available at https://github.com/roysoumya/GeneMask.
UR - http://www.scopus.com/inward/record.url?scp=85175804493&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2307.15933
DO - 10.48550/arXiv.2307.15933
M3 - Conference contribution
AN - SCOPUS:85175804493
T3 - Frontiers in Artificial Intelligence and Applications
SP - 2002
EP - 2009
BT - ECAI 2023
A2 - Gal, Kobi
A2 - Gal, Kobi
A2 - Nowe, Ann
A2 - Nalepa, Grzegorz J.
A2 - Fairstein, Roy
A2 - Radulescu, Roxana
Y2 - 30 September 2023 through 4 October 2023
ER -