Details
Original language | English |
---|---|
Title of host publication | IEEE International Conference on Acoustics, Speech and Signal Processing |
Subtitle of host publication | ICASSP 2023 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Number of pages | 5 |
ISBN (electronic) | 9781728163277 |
ISBN (print) | 978-1-7281-6328-4 |
Publication status | Published - 2023 |
Event | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023 |
Publication series
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
Volume | 2023-June |
ISSN (Print) | 1520-6149 |
Abstract
Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
Keywords
- adaptive inference, efficient deep learning, efficient edge analytics, self-distillation, speech emotion recognition
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Signal Processing
- Engineering(all)
- Electrical and Electronic Engineering
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Fast Yet Effective Speech Emotion Recognition with Self-Distillation
AU - Ren, Zhao
AU - Chang, Yi
AU - Schuller, Björn W.
AU - Nguyen, Thanh Tam
N1 - Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A.
PY - 2023
Y1 - 2023
N2 - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
AB - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
KW - adaptive inference
KW - efficient deep learning
KW - efficient edge analytics
KW - self-distillation
KW - speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85176292202&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2210.14636
DO - 10.48550/arXiv.2210.14636
M3 - Conference contribution
AN - SCOPUS:85176292202
SN - 978-1-7281-6328-4
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - IEEE International Conference on Acoustics, Speech and Signal Processing
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -