Fast Yet Effective Speech Emotion Recognition with Self-Distillation

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autorschaft

  • Zhao Ren
  • Yi Chang
  • Björn W. Schuller
  • Thanh Tam Nguyen

Organisationseinheiten

Externe Organisationen

  • Imperial College London
  • Universität Augsburg
  • Griffith University
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksIEEE International Conference on Acoustics, Speech and Signal Processing
UntertitelICASSP 2023
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seitenumfang5
ISBN (elektronisch)9781728163277
ISBN (Print)978-1-7281-6328-4
PublikationsstatusVeröffentlicht - 2023
Veranstaltung48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 - Rhodes Island, Griechenland
Dauer: 4 Juni 202310 Juni 2023

Publikationsreihe

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Band2023-June
ISSN (Print)1520-6149

Abstract

Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

ASJC Scopus Sachgebiete

Zitieren

Fast Yet Effective Speech Emotion Recognition with Self-Distillation. / Ren, Zhao; Chang, Yi; Schuller, Björn W. et al.
IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Band 2023-June).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Ren, Z, Chang, Y, Schuller, BW & Nguyen, TT 2023, Fast Yet Effective Speech Emotion Recognition with Self-Distillation. in IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Bd. 2023-June, Institute of Electrical and Electronics Engineers Inc., 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Griechenland, 4 Juni 2023. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895
Ren, Z., Chang, Y., Schuller, B. W., & Nguyen, T. T. (2023). Fast Yet Effective Speech Emotion Recognition with Self-Distillation. In IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Band 2023-June). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.48550/arXiv.2210.14636, https://doi.org/10.1109/ICASSP49357.2023.10094895
Ren Z, Chang Y, Schuller BW, Nguyen TT. Fast Yet Effective Speech Emotion Recognition with Self-Distillation. in IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc. 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.48550/arXiv.2210.14636, 10.1109/ICASSP49357.2023.10094895
Ren, Zhao ; Chang, Yi ; Schuller, Björn W. et al. / Fast Yet Effective Speech Emotion Recognition with Self-Distillation. IEEE International Conference on Acoustics, Speech and Signal Processing: ICASSP 2023. Institute of Electrical and Electronics Engineers Inc., 2023. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
Download
@inproceedings{6110c9ffde7a4ba2aa1e80daaa504fa5,
title = "Fast Yet Effective Speech Emotion Recognition with Self-Distillation",
abstract = "Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.",
keywords = "adaptive inference, efficient deep learning, efficient edge analytics, self-distillation, speech emotion recognition",
author = "Zhao Ren and Yi Chang and Schuller, {Bj{\"o}rn W.} and Nguyen, {Thanh Tam}",
note = "Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A. ; 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023 ; Conference date: 04-06-2023 Through 10-06-2023",
year = "2023",
doi = "10.48550/arXiv.2210.14636",
language = "English",
isbn = "978-1-7281-6328-4",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
booktitle = "IEEE International Conference on Acoustics, Speech and Signal Processing",
address = "United States",

}

Download

TY - GEN

T1 - Fast Yet Effective Speech Emotion Recognition with Self-Distillation

AU - Ren, Zhao

AU - Chang, Yi

AU - Schuller, Björn W.

AU - Nguyen, Thanh Tam

N1 - Funding Information: This research was funded by the Federal Ministry of Education and Research (BMBF), Germany under the project LeibnizKILabor with grant No. 01DD20003, and the research projects “IIP-Ecosphere”, granted by the German Federal Ministry for Economics and Climate Action (BMWK) via funding code No. 01MK20006A.

PY - 2023

Y1 - 2023

N2 - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

AB - Speech emotion recognition (SER) is the task of recognising humans' emotional states from speech. SER is extremely prevalent in helping dialogue systems to truly understand our emotions and become a trustworthy human conversational partner. Due to the lengthy nature of speech, SER also suffers from the lack of abundant labelled data for powerful models like deep neural networks. Pre-trained complex models on large-scale speech datasets have been successfully applied to SER via transfer learning. However, fine-tuning complex models still requires large memory space and results in low inference efficiency. In this paper, we argue achieving a fast yet effective SER is possible with self-distillation, a method of simultaneously fine-tuning a pretrained model and training shallower versions of itself. The benefits of our self-distillation framework are threefold: (1) the adoption of self-distillation method upon the acoustic modality breaks through the limited ground-truth of speech data, and outperforms the existing models' performance on an SER dataset; (2) executing powerful models at different depths can achieve adaptive accuracy-efficiency trade-offs on resource-limited edge devices; (3) a new fine-tuning process rather than training from scratch for self-distillation leads to faster learning time and the state-of-the-art accuracy on data with small quantities of label information.

KW - adaptive inference

KW - efficient deep learning

KW - efficient edge analytics

KW - self-distillation

KW - speech emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85176292202&partnerID=8YFLogxK

U2 - 10.48550/arXiv.2210.14636

DO - 10.48550/arXiv.2210.14636

M3 - Conference contribution

AN - SCOPUS:85176292202

SN - 978-1-7281-6328-4

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

BT - IEEE International Conference on Acoustics, Speech and Signal Processing

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 48th IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023

Y2 - 4 June 2023 through 10 June 2023

ER -