Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

Research output: Contribution to journalConference articleResearchpeer review

Authors

View graph of relations

Details

Original languageEnglish
Pages (from-to)4578-4582
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2023-August
Publication statusPublished - 2023
Event24th International Speech Communication Association, Interspeech 2023 - Dublin, Ireland
Duration: 20 Aug 202324 Aug 2023

Abstract

The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

Keywords

    children's speech, model combination, speech recognition

ASJC Scopus subject areas

Cite this

Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. / Gebauer, Christopher; Rumberg, Lars; Ehlert, Hanna et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2023-August, 2023, p. 4578-4582.

Research output: Contribution to journalConference articleResearchpeer review

Gebauer, C, Rumberg, L, Ehlert, H, Lüdtke, U & Ostermann, J 2023, 'Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, vol. 2023-August, pp. 4578-4582. https://doi.org/10.21437/Interspeech.2023-926
Gebauer, C., Rumberg, L., Ehlert, H., Lüdtke, U., & Ostermann, J. (2023). Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4578-4582. https://doi.org/10.21437/Interspeech.2023-926
Gebauer C, Rumberg L, Ehlert H, Lüdtke U, Ostermann J. Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4578-4582. doi: 10.21437/Interspeech.2023-926
Gebauer, Christopher ; Rumberg, Lars ; Ehlert, Hanna et al. / Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Vol. 2023-August. pp. 4578-4582.
Download
@article{35fbe82d3b1a4b288161bbe60b88b8f1,
title = "Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech",
abstract = "The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.",
keywords = "children's speech, model combination, speech recognition",
author = "Christopher Gebauer and Lars Rumberg and Hanna Ehlert and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",
year = "2023",
doi = "10.21437/Interspeech.2023-926",
language = "English",
volume = "2023-August",
pages = "4578--4582",
note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

AU - Gebauer, Christopher

AU - Rumberg, Lars

AU - Ehlert, Hanna

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

AB - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

KW - children's speech

KW - model combination

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85171588606&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-926

DO - 10.21437/Interspeech.2023-926

M3 - Conference article

AN - SCOPUS:85171588606

VL - 2023-August

SP - 4578

EP - 4582

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

By the same author(s)