Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Research output: Contribution to journalConference articleResearchpeer review

Authors

  • Lars Rumberg
  • Christopher Gebauer
  • Hanna Ehlert
  • Maren Wallbaum
  • Ulrike Lüdtke
  • Jörn Ostermann
View graph of relations

Details

Original languageEnglish
Pages (from-to)4583-4587
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Abstract

Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

Keywords

    Automatic Speech Recognition, Children's speech, Uncertainty

ASJC Scopus subject areas

Cite this

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. / Rumberg, Lars; Gebauer, Christopher; Ehlert, Hanna et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 4583-4587.

Research output: Contribution to journalConference articleResearchpeer review

Rumberg, L, Gebauer, C, Ehlert, H, Wallbaum, M, Lüdtke, U & Ostermann, J 2023, 'Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 4583-4587. https://doi.org/10.21437/Interspeech.2023-907
Rumberg, L., Gebauer, C., Ehlert, H., Wallbaum, M., Lüdtke, U., & Ostermann, J. (2023). Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4583-4587. https://doi.org/10.21437/Interspeech.2023-907
Rumberg L, Gebauer C, Ehlert H, Wallbaum M, Lüdtke U, Ostermann J. Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4583-4587. doi: 10.21437/Interspeech.2023-907
Rumberg, Lars ; Gebauer, Christopher ; Ehlert, Hanna et al. / Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Vol. 2023-August. pp. 4583-4587.
Download
@article{3aa977af3b48412a81492a806bdcfcea,
title = "Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition",
abstract = "Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.",
keywords = "Automatic Speech Recognition, Children's speech, Uncertainty",
author = "Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Maren Wallbaum and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",
year = "2023",
doi = "10.21437/Interspeech.2023-907",
language = "English",
volume = "2023-August",
pages = "4583--4587",
note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

AU - Rumberg, Lars

AU - Gebauer, Christopher

AU - Ehlert, Hanna

AU - Wallbaum, Maren

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

AB - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

KW - Automatic Speech Recognition

KW - Children's speech

KW - Uncertainty

UR - http://www.scopus.com/inward/record.url?scp=85171598603&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-907

DO - 10.21437/Interspeech.2023-907

M3 - Conference article

AN - SCOPUS:85171598603

VL - 2023-August

SP - 4583

EP - 4587

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

By the same author(s)