Details
Original language | English |
---|---|
Title of host publication | Neural Information Processing |
Subtitle of host publication | 30th International Conference, ICONIP 2023, Changsha, China, November 20–23, 2023, Proceedings, Part I |
Editors | Biao Luo, Long Cheng, Zheng-Guang Wu, Hongyi Li, Chaojie Li |
Publisher | Springer Science and Business Media Deutschland GmbH |
Pages | 173-187 |
Number of pages | 15 |
ISBN (electronic) | 978-981-99-8079-6 |
ISBN (print) | 9789819980789 |
Publication status | Published - 14 Nov 2023 |
Event | 30th International Conference on Neural Information Processing, ICONIP 2023 - Changsha, China Duration: 20 Nov 2023 → 23 Nov 2023 |
Publication series
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Volume | 14447 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (electronic) | 1611-3349 |
Abstract
Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mude-streda, obtained from real industrial processing devices1 (1 Link to code and dataset: https://github.com/hubtru/Minape).
Keywords
- Isotropic Architecture, Multimodal Classification, Patch Embedding, Time Series
ASJC Scopus subject areas
- Mathematics(all)
- Theoretical Computer Science
- Computer Science(all)
- General Computer Science
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Neural Information Processing: 30th International Conference, ICONIP 2023, Changsha, China, November 20–23, 2023, Proceedings, Part I. ed. / Biao Luo; Long Cheng; Zheng-Guang Wu; Hongyi Li; Chaojie Li. Springer Science and Business Media Deutschland GmbH, 2023. p. 173-187 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14447 LNCS).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Multimodal Isotropic Neural Architecture with Patch Embedding
AU - Truchan, Hubert
AU - Naumov, Evgenii
AU - Abedin, Rezaul
AU - Palmer, Gregory
AU - Ahmadi, Zahra
PY - 2023/11/14
Y1 - 2023/11/14
N2 - Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mude-streda, obtained from real industrial processing devices1 (1 Link to code and dataset: https://github.com/hubtru/Minape).
AB - Patch embedding has been a significant advancement in Transformer-based models, particularly the Vision Transformer (ViT), as it enables handling larger image sizes and mitigating the quadratic runtime of self-attention layers in Transformers. Moreover, it allows for capturing global dependencies and relationships between patches, enhancing effective image understanding and analysis. However, it is important to acknowledge that Convolutional Neural Networks (CNNs) continue to excel in scenarios with limited data availability. Their efficiency in terms of memory usage and latency makes them particularly suitable for deployment on edge devices. Expanding upon this, we propose Minape, a novel multimodal isotropic convolutional neural architecture that incorporates patch embedding to both time series and image data for classification purposes. By employing isotropic models, Minape addresses the challenges posed by varying data sizes and complexities of the data. It groups samples based on modality type, creating two-dimensional representations that undergo linear embedding before being processed by a scalable isotropic convolutional network architecture. The outputs of these pathways are merged and fed to a temporal classifier. Experimental results demonstrate that Minape significantly outperforms existing approaches in terms of accuracy while requiring fewer than 1M parameters and occupying less than 12 MB in size. This performance was observed on multimodal benchmark datasets and the authors’ newly collected multi-dimensional multimodal dataset, Mude-streda, obtained from real industrial processing devices1 (1 Link to code and dataset: https://github.com/hubtru/Minape).
KW - Isotropic Architecture
KW - Multimodal Classification
KW - Patch Embedding
KW - Time Series
UR - http://www.scopus.com/inward/record.url?scp=85181982796&partnerID=8YFLogxK
U2 - 10.1007/978-981-99-8079-6_14
DO - 10.1007/978-981-99-8079-6_14
M3 - Conference contribution
AN - SCOPUS:85181982796
SN - 9789819980789
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 173
EP - 187
BT - Neural Information Processing
A2 - Luo, Biao
A2 - Cheng, Long
A2 - Wu, Zheng-Guang
A2 - Li, Hongyi
A2 - Li, Chaojie
PB - Springer Science and Business Media Deutschland GmbH
T2 - 30th International Conference on Neural Information Processing, ICONIP 2023
Y2 - 20 November 2023 through 23 November 2023
ER -