Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Publikation: Beitrag in FachzeitschriftKonferenzaufsatz in FachzeitschriftForschungPeer-Review

Autoren

  • Lars Rumberg
  • Christopher Gebauer
  • Hanna Ehlert
  • Maren Wallbaum
  • Ulrike Lüdtke
  • Jörn Ostermann
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)4583-4587
Seitenumfang5
FachzeitschriftProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Jahrgang2023-August
PublikationsstatusVeröffentlicht - 2023
Veranstaltung24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland
Dauer: 20 Aug. 202324 Aug. 2023

Abstract

Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

ASJC Scopus Sachgebiete

Zitieren

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. / Rumberg, Lars; Gebauer, Christopher; Ehlert, Hanna et al.
in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jahrgang 2023-August, 2023, S. 4583-4587.

Publikation: Beitrag in FachzeitschriftKonferenzaufsatz in FachzeitschriftForschungPeer-Review

Rumberg, L, Gebauer, C, Ehlert, H, Wallbaum, M, Lüdtke, U & Ostermann, J 2023, 'Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jg. 2023-August, S. 4583-4587. https://doi.org/10.21437/Interspeech.2023-907
Rumberg, L., Gebauer, C., Ehlert, H., Wallbaum, M., Lüdtke, U., & Ostermann, J. (2023). Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4583-4587. https://doi.org/10.21437/Interspeech.2023-907
Rumberg L, Gebauer C, Ehlert H, Wallbaum M, Lüdtke U, Ostermann J. Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4583-4587. doi: 10.21437/Interspeech.2023-907
Rumberg, Lars ; Gebauer, Christopher ; Ehlert, Hanna et al. / Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Jahrgang 2023-August. S. 4583-4587.
Download
@article{3aa977af3b48412a81492a806bdcfcea,
title = "Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition",
abstract = "Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.",
keywords = "Automatic Speech Recognition, Children's speech, Uncertainty",
author = "Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Maren Wallbaum and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",
year = "2023",
doi = "10.21437/Interspeech.2023-907",
language = "English",
volume = "2023-August",
pages = "4583--4587",
note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

AU - Rumberg, Lars

AU - Gebauer, Christopher

AU - Ehlert, Hanna

AU - Wallbaum, Maren

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

AB - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

KW - Automatic Speech Recognition

KW - Children's speech

KW - Uncertainty

UR - http://www.scopus.com/inward/record.url?scp=85171598603&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-907

DO - 10.21437/Interspeech.2023-907

M3 - Conference article

AN - SCOPUS:85171598603

VL - 2023-August

SP - 4583

EP - 4587

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Von denselben Autoren