Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | IEEE International Conference on Acoustics, Speech and Signal Processing |
Untertitel | ICASSP 2023 |
Herausgeber (Verlag) | Institute of Electrical and Electronics Engineers Inc. |
Seitenumfang | 5 |
ISBN (elektronisch) | 9781728163277 |
ISBN (Print) | 978-1-7281-6328-4 |
Publikationsstatus | Veröffentlicht - 2023 |
Veranstaltung | 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Griechenland Dauer: 4 Juni 2023 → 10 Juni 2023 |
Publikationsreihe
Name | ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings |
---|---|
Band | 2023-June |
ISSN (Print) | 1520-6149 |
Abstract
Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Software
- Informatik (insg.)
- Signalverarbeitung
- Ingenieurwesen (insg.)
- Elektrotechnik und Elektronik
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Band 2023-June).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review
}
TY - GEN
T1 - Fast Yet Effective Speech Emotion Recognition with Self-Distillation
AU - Ren, Zhao
AU - Chang, Yi
AU - Schuller, Björn W.
AU - Nguyen, Thanh Tam
N1 - Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A.
PY - 2023
Y1 - 2023
N2 - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
AB - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.
KW - adaptive inference
KW - efficient deep learning
KW - efficient edge analytics
KW - self-distillation
KW - speech emotion recognition
UR - http://www.scopus.com/inward/record.url?scp=85176292202&partnerID=8YFLogxK
U2 - 10.48550/arXiv.2210.14636
DO - 10.48550/arXiv.2210.14636
M3 - Conference contribution
AN - SCOPUS:85176292202
SN - 978-1-7281-6328-4
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
BT - IEEE International Conference on Acoustics, Speech and Signal Processing
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023
Y2 - 4 June 2023 through 10 June 2023
ER -