Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Zhao Ren; Yi Chang; Björn W. Schuller; Thanh Tam Nguyen

doi:10.48550/arXiv.2210.14636

Details

Originalsprache	Englisch
Titel des Sammelwerks	IEEE International Conference on Acoustics, Speech and Signal Processing
Untertitel	ICASSP 2023
Herausgeber (Verlag)	Institute of Electrical and Electronics Engineers Inc.
Seitenumfang	5
ISBN (elektronisch)	9781728163277
ISBN (Print)	978-1-7281-6328-4
Publikationsstatus	Veröffentlicht - 2023
Veranstaltung	48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Griechenland Dauer: 4 Juni 2023 → 10 Juni 2023

Publikationsreihe

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Band	2023-June
ISSN (Print)	1520-6149

Abstract

Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

ASJC Scopus Sachgebiete

Informatik (insg.)
Software
Informatik (insg.)
Signalverarbeitung
Ingenieurwesen (insg.)
Elektrotechnik und Elektronik

Zitieren

Fast Yet Effective Speech Emotion Recognition with Self-Distillation. / Ren, Zhao; Chang, Yi; Schuller, Björn W. et al.
IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Band 2023-June).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Ren, Z, Chang, Y, Schuller, BW & Nguyen, TT 2023, Fast Yet Effective Speech Emotion Recognition with Self-Distillation. in IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Bd. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Griechenland, 4 Juni 2023. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895

Ren, Z., Chang, Y., Schuller, B. W., & Nguyen, T. T. (2023). Fast Yet Effective Speech Emotion Recognition with Self-Distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Band 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895

Ren Z, Chang Y, Schuller BW, Nguyen TT. Fast Yet Effective Speech Emotion Recognition with Self-Distillation. in IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.48550/arXiv.2210.14636, 10.1109/ICASSP49357.2023.10094895

Ren, Zhao ; Chang, Yi ; Schuller, Björn W. et al. / Fast Yet Effective Speech Emotion Recognition with Self-Distillation. IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Download

@inproceedings{6110c9ffde7a4ba2aa1e80daaa504fa5,

title = "Fast Yet Effective Speech Emotion Recognition with Self-Distillation",

abstract = "Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.",

keywords = "adaptive inference, efficient deep learning, efficient edge analytics, self-distillation, speech emotion recognition",

author = "Zhao Ren and Yi Chang and Schuller, {Bj{\"o}rn W.} and Nguyen, {Thanh Tam}",

note = "Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A. ; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",

year = "2023",

doi = "10.48550/arXiv.2210.14636",

language = "English",

isbn = "978-1-7281-6328-4",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing",

address = "United States",

}

Download

TY - GEN

T1 - Fast Yet Effective Speech Emotion Recognition with Self-Distillation

AU - Ren, Zhao

AU - Chang, Yi

AU - Schuller, Björn W.

AU - Nguyen, Thanh Tam

N1 - Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A.

PY - 2023

Y1 - 2023

N2 - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

AB - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

KW - adaptive inference

KW - efficient deep learning

KW - efficient edge analytics

KW - self-distillation

KW - speech emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85176292202&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2210.14636

DO - 10.48550/arXiv.2210.14636

M3 - Conference contribution

AN - SCOPUS:85176292202

SN - 978-1-7281-6328-4

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - IEEE International Conference on Acoustics, Speech and Signal Processing

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -

Research@Leibniz University

Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Autorschaft

Organisationseinheiten

Externe Organisationen

Details

Publikationsreihe

Abstract

ASJC Scopus Sachgebiete

Zitieren