Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

Christopher Gebauer; Lars Rumberg; Hanna Ehlert; Ulrike Lüdtke; Jörn Ostermann

doi:10.21437/Interspeech.2023-926

Details

Originalsprache	Englisch
Seiten (von - bis)	4578-4582
Seitenumfang	5
Fachzeitschrift	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Jahrgang	2023-August
Publikationsstatus	Veröffentlicht - 2023
Veranstaltung	24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland Dauer: 20 Aug. 2023 → 24 Aug. 2023

Abstract

The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

ASJC Scopus Sachgebiete

Geisteswissenschaftliche Fächer (insg.)
Sprache und Linguistik
Informatik (insg.)
Mensch-Maschine-Interaktion
Informatik (insg.)
Signalverarbeitung
Informatik (insg.)
Software
Mathematik (insg.)
Modellierung und Simulation

Zitieren

Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. / Gebauer, Christopher; Rumberg, Lars; Ehlert, Hanna et al.
in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jahrgang 2023-August, 2023, S. 4578-4582.

Publikation: Beitrag in Fachzeitschrift › Konferenzaufsatz in Fachzeitschrift › Forschung › Peer-Review

Gebauer, C, Rumberg, L, Ehlert, H, Lüdtke, U & Ostermann, J 2023, 'Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech', Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jg. 2023-August, S. 4578-4582. https://doi.org/10.21437/Interspeech.2023-926

Gebauer, C., Rumberg, L., Ehlert, H., Lüdtke, U., & Ostermann, J. (2023). Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2023-August, 4578-4582. https://doi.org/10.21437/Interspeech.2023-926

Gebauer C, Rumberg L, Ehlert H, Lüdtke U, Ostermann J. Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023;2023-August:4578-4582. doi: 10.21437/Interspeech.2023-926

Gebauer, Christopher ; Rumberg, Lars ; Ehlert, Hanna et al. / Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech. in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH. 2023 ; Jahrgang 2023-August. S. 4578-4582.

Download

@article{35fbe82d3b1a4b288161bbe60b88b8f1,

title = "Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech",

abstract = "The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.",

keywords = "children's speech, model combination, speech recognition",

author = "Christopher Gebauer and Lars Rumberg and Hanna Ehlert and Ulrike L{\"u}dtke and J{\"o}rn Ostermann",

year = "2023",

doi = "10.21437/Interspeech.2023-926",

language = "English",

volume = "2023-August",

pages = "4578--4582",

note = "24th International Speech Communication Association, Interspeech 2023 ; Conference date: 20-08-2023 Through 24-08-2023",

}

Download

TY - JOUR

T1 - Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

AU - Gebauer, Christopher

AU - Rumberg, Lars

AU - Ehlert, Hanna

AU - Lüdtke, Ulrike

AU - Ostermann, Jörn

PY - 2023

Y1 - 2023

N2 - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

AB - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.

KW - children's speech

KW - model combination

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85171588606&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2023-926

DO - 10.21437/Interspeech.2023-926

M3 - Conference article

AN - SCOPUS:85171588606

VL - 2023-August

SP - 4578

EP - 4582

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

T2 - 24th International Speech Communication Association, Interspeech 2023

Y2 - 20 August 2023 through 24 August 2023

ER -

Research@Leibniz University

Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech

Autoren

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data