Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Lars Rumberg; Christopher Gebauer; Hanna Ehlert; Maren Wallbaum; Ulrike Lüdtke; Jörn Ostermann

doi:10.21437/Interspeech.2023-907

Details

Originalsprache	Englisch
Seiten (von - bis)	4583-4587
Seitenumfang	5
Fachzeitschrift	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Jahrgang	2023-August
Publikationsstatus	Veröffentlicht - 2023
Veranstaltung	24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland Dauer: 20 Aug. 2023 → 24 Aug. 2023

Abstract

Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

ASJC Scopus Sachgebiete

Geisteswissenschaftliche Fächer (insg.)
Sprache und Linguistik
Informatik (insg.)
Mensch-Maschine-Interaktion
Informatik (insg.)
Signalverarbeitung
Informatik (insg.)
Software
Mathematik (insg.)
Modellierung und Simulation

Zitieren

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. / Rumberg, Lars; Gebauer, Christopher; Ehlert, Hanna et al.
in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jahrgang 2023-August, 2023, S. 4583-4587.

Publikation: Beitrag in Fachzeitschrift › Konferenzaufsatz in Fachzeitschrift › Forschung › Peer-Review

Rumberg, L, Gebauer, C, Ehlert, H, Wallbaum, M, Lüdtke, U & Ostermann, J 2023, 'Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jg. 2023-August, S. 4583-4587. https://doi.org/10.21437/Interspeech.2023-907

Rumberg, L., Gebauer, C., Ehlert, H., Wallbaum, M., Lüdtke, U., & Ostermann, J. (2023). Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4583-4587. https://doi.org/10.21437/Interspeech.2023-907

Rumberg L, Gebauer C, Ehlert H, Wallbaum M, Lüdtke U, Ostermann J. Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4583-4587. doi: 10.21437/Interspeech.2023-907

Rumberg, Lars ; Gebauer, Christopher ; Ehlert, Hanna et al. / Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition. in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Jahrgang 2023-August. S. 4583-4587.

Download

@article{3aa977af3b48412a81492a806bdcfcea,

title = "Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition",

abstract = "Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.",

keywords = "Automatic Speech Recognition, Children's speech, Uncertainty",

author = "Lars Rumberg and Christopher Gebauer and Hanna Ehlert and Maren Wallbaum and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",

year = "2023",

doi = "10.21437/Interspeech.2023-907",

language = "English",

volume = "2023-August",

pages = "4583--4587",

note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

AU - Rumberg, Lars

AU - Gebauer, Christopher

AU - Ehlert, Hanna

AU - Wallbaum, Maren

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

AB - Predictive uncertainty estimation of deep neural networks is important when their outputs are used for high stakes decision making. We investigate token-level uncertainty of connectionist temporal classification (CTC) based automatic speech recognition models. We propose an approach, which considers that not all changes at frame-level lead to a change at token-level after CTC decoding. The approach shows promising performance for prediction of recognition errors on TIMIT, Mozilla Common Voice (MCV) and kidsTALC, a corpus of children's speech, using two different model architectures, while introducing only negligible computational overhead. Our approach identifies over 80 % of a wav2vec2.0 model's errors on MCV by selecting 10 % of the tokens. We further show, that the predictive uncertainty estimate relates to the uncertainty of a human annotator, by re-annotating 500 utterances of kidsTALC.

KW - Automatic Speech Recognition

KW - Children's speech

KW - Uncertainty

UR - http://www.scopus.com/inward/record.url?scp=85171598603&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-907

DO - 10.21437/Interspeech.2023-907

M3 - Conference article

AN - SCOPUS:85171598603

VL - 2023-August

SP - 4583

EP - 4587

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Research@Leibniz University

Uncertainty Estimation for Connectionist Temporal Classification Based Automatic Speech Recognition

Autoren

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

Self-supervised domain adaptation for machinery remaining useful life prediction

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

Matched Filter for Acoustic Emission Monitoring in Noisy Environments: Application to Wire Break Detection

Blind extraction of guitar effects through blind system inversion and neural guitar effect modeling