Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Zhao Ren; Yi Chang; Björn W. Schuller; Thanh Tam Nguyen

doi:10.48550/arXiv.2210.14636

Details

Original language	English
Title of host publication	IEEE International Conference on Acoustics, Speech and Signal Processing
Subtitle of host publication	ICASSP 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.
Number of pages	5
ISBN (electronic)	9781728163277
ISBN (print)	978-1-7281-6328-4
Publication status	Published - 2023
Event	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Greece Duration: 4 Jun 2023 → 10 Jun 2023

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	2023-June
ISSN (Print)	1520-6149

Abstract

Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

Keywords

adaptive inference, efficient deep learning, efficient edge analytics, self-distillation, speech emotion recognition

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Signal Processing
Engineering(all)
Electrical and Electronic Engineering

Cite this

Fast Yet Effective Speech Emotion Recognition with Self-Distillation. / Ren, Zhao; Chang, Yi; Schuller, Björn W. et al.
IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Ren, Z, Chang, Y, Schuller, BW & Nguyen, TT 2023, Fast Yet Effective Speech Emotion Recognition with Self-Distillation. in IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Greece, 4 Jun 2023. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895

Ren, Z., Chang, Y., Schuller, B. W., & Nguyen, T. T. (2023). Fast Yet Effective Speech Emotion Recognition with Self-Distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895

Ren Z, Chang Y, Schuller BW, Nguyen TT. Fast Yet Effective Speech Emotion Recognition with Self-Distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.48550/arXiv.2210.14636, 10.1109/ICASSP49357.2023.10094895

Ren, Zhao ; Chang, Yi ; Schuller, Björn W. et al. / Fast Yet Effective Speech Emotion Recognition with Self-Distillation. IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Download

@inproceedings{6110c9ffde7a4ba2aa1e80daaa504fa5,

title = "Fast Yet Effective Speech Emotion Recognition with Self-Distillation",

abstract = "Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.",

keywords = "adaptive inference, efficient deep learning, efficient edge analytics, self-distillation, speech emotion recognition",

author = "Zhao Ren and Yi Chang and Schuller, {Bj{\"o}rn W.} and Nguyen, {Thanh Tam}",

note = "Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A. ; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.48550/arXiv.2210.14636",

language = "English",

isbn = "978-1-7281-6328-4",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing",

address = "United States",

}

Download

TY - GEN

T1 - Fast Yet Effective Speech Emotion Recognition with Self-Distillation

AU - Ren, Zhao

AU - Chang, Yi

AU - Schuller, Björn W.

AU - Nguyen, Thanh Tam

N1 - Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A.

PY - 2023

Y1 - 2023

N2 - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

AB - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

KW - adaptive inference

KW - efficient deep learning

KW - efficient edge analytics

KW - self-distillation

KW - speech emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85176292202&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2210.14636

DO - 10.48550/arXiv.2210.14636

M3 - Conference contribution

AN - SCOPUS:85176292202

SN - 978-1-7281-6328-4

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - IEEE International Conference on Acoustics, Speech and Signal Processing

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -

Research@Leibniz University

Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Authors

Research Organisations

External Research Organisations