Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

Publikation: Beitrag in FachzeitschriftKonferenzaufsatz in FachzeitschriftForschungPeer-Review

Autoren

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)4578-4582
Seitenumfang5
FachzeitschriftProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Jahrgang2023-August
PublikationsstatusVeröffentlicht - 2023
Veranstaltung24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland
Dauer: 20 Aug. 202324 Aug. 2023

Abstract

The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

ASJC Scopus Sachgebiete

Zitieren

Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. / Gebauer, Christopher; Rumberg, Lars; Ehlert, Hanna et al.
in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jahrgang 2023-August, 2023, S. 4578-4582.

Publikation: Beitrag in FachzeitschriftKonferenzaufsatz in FachzeitschriftForschungPeer-Review

Gebauer, C, Rumberg, L, Ehlert, H, Lüdtke, U & Ostermann, J 2023, 'Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jg. 2023-August, S. 4578-4582. https://doi.org/10.21437/Interspeech.2023-926
Gebauer, C., Rumberg, L., Ehlert, H., Lüdtke, U., & Ostermann, J. (2023). Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4578-4582. https://doi.org/10.21437/Interspeech.2023-926
Gebauer C, Rumberg L, Ehlert H, Lüdtke U, Ostermann J. Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4578-4582. doi: 10.21437/Interspeech.2023-926
Gebauer, Christopher ; Rumberg, Lars ; Ehlert, Hanna et al. / Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Jahrgang 2023-August. S. 4578-4582.
Download
@article{35fbe82d3b1a4b288161bbe60b88b8f1,
title = "Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech",
abstract = "The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.",
keywords = "children's speech, model combination, speech recognition",
author = "Christopher Gebauer and Lars Rumberg and Hanna Ehlert and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",
year = "2023",
doi = "10.21437/Interspeech.2023-926",
language = "English",
volume = "2023-August",
pages = "4578--4582",
note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

AU - Gebauer, Christopher

AU - Rumberg, Lars

AU - Ehlert, Hanna

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

AB - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

KW - children's speech

KW - model combination

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85171588606&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-926

DO - 10.21437/Interspeech.2023-926

M3 - Conference article

AN - SCOPUS:85171588606

VL - 2023-August

SP - 4578

EP - 4582

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Von denselben Autoren