Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Lars Rumberg; Christopher Gebauer; Hanna Ehlert; Maren Wallbaum; Ulrike Lüdtke; Jörn Ostermann

doi:10.21437/Interspeech.2023-907

Details

Original language	English
Pages (from-to)	4583-4587
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2023-August
Publication status	Published - 2023
Event	24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland Duration: 20 Aug 2023 → 24 Aug 2023

Abstract

Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

Keywords

Automatic Speech Recognition, Children's speech, Uncertainty

ASJC Scopus subject areas

Arts and Humanities(all)
Language and Linguistics
Computer Science(all)
Human-Computer Interaction
Computer Science(all)
Signal Processing
Computer Science(all)
Software
Mathematics(all)
Modelling and Simulation

Cite this

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. / Rumberg, Lars; Gebauer, Christopher; Ehlert, Hanna et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 4583-4587.

Research output: Contribution to journal › Conference article › Research › peer review

Rumberg, L, Gebauer, C, Ehlert, H, Wallbaum, M, Lüdtke, U & Ostermann, J 2023, 'Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 4583-4587. https://doi.org/10.21437/Interspeech.2023-907

Rumberg, L., Gebauer, C., Ehlert, H., Wallbaum, M., Lüdtke, U., & Ostermann, J. (2023). Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4583-4587. https://doi.org/10.21437/Interspeech.2023-907

Rumberg L, Gebauer C, Ehlert H, Wallbaum M, Lüdtke U, Ostermann J. Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4583-4587. doi: 10.21437/Interspeech.2023-907

Rumberg, Lars ; Gebauer, Christopher ; Ehlert, Hanna et al. / Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Vol. 2023-August. pp. 4583-4587.

Download

@article{3aa977af3b48412a81492a806bdcfcea,

title = "Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition",

abstract = "Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.",

keywords = "Automatic Speech Recognition, Children's speech, Uncertainty",

author = "Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Maren Wallbaum and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",

year = "2023",

doi = "10.21437/Interspeech.2023-907",

language = "English",

volume = "2023-August",

pages = "4583--4587",

note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

AU - Rumberg, Lars

AU - Gebauer, Christopher

AU - Ehlert, Hanna

AU - Wallbaum, Maren

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

AB - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

KW - Automatic Speech Recognition

KW - Children's speech

KW - Uncertainty

UR - http://www.scopus.com/inward/record.url?scp=85171598603&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-907

DO - 10.21437/Interspeech.2023-907

M3 - Conference article

AN - SCOPUS:85171598603

VL - 2023-August

SP - 4583

EP - 4587

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Research@Leibniz University

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression