Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 4578-4582 |
Seitenumfang | 5 |
Fachzeitschrift | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
Jahrgang | 2023-August |
Publikationsstatus | Veröffentlicht - 2023 |
Veranstaltung | 24th International Speech Communication Association, Interspeech 2023 - Dublin, Irland Dauer: 20 Aug. 2023 → 24 Aug. 2023 |
Abstract
The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.
ASJC Scopus Sachgebiete
- Geisteswissenschaftliche Fächer (insg.)
- Sprache und Linguistik
- Informatik (insg.)
- Mensch-Maschine-Interaktion
- Informatik (insg.)
- Signalverarbeitung
- Informatik (insg.)
- Software
- Mathematik (insg.)
- Modellierung und Simulation
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Jahrgang 2023-August, 2023, S. 4578-4582.
Publikation: Beitrag in Fachzeitschrift › Konferenzaufsatz in Fachzeitschrift › Forschung › Peer-Review
}
TY - JOUR
T1 - Exploiting Diversity of Automatic Transcripts from Distinct Speech Recognition Techniques for Children's Speech
AU - Gebauer, Christopher
AU - Rumberg, Lars
AU - Ehlert, Hanna
AU - Lüdtke, Ulrike
AU - Ostermann, Jörn
PY - 2023
Y1 - 2023
N2 - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.
AB - The recent advances in automatic speech recognition (ASR) technologies using end-to-end machine learning do not transfer well to children's speech. One cause is the high pronunciation variability and frequent violations of grammatical or lexical rules, which impedes the successful usage of language models or powerful context-representations. Applying these methods affects the nature of the resulting transcript rather than improving the overall recognition performance. In this work we analyze the diversity of the transcripts from distinct ASR-systems for children's speech and exploit it by applying a common combination scheme. We consider systems with various degree of context: Greedily decoded and lexicon-constrained connectionist temporal classification-models, attention-based encoder decoders, and Wav2Vec 2.0, a powerful context-representation. By exploiting their diversity we achieve a relative improvement of 17.8 % on phone recognition compared to the best single system.
KW - children's speech
KW - model combination
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85171588606&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2023-926
DO - 10.21437/Interspeech.2023-926
M3 - Conference article
AN - SCOPUS:85171588606
VL - 2023-August
SP - 4578
EP - 4582
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SN - 2308-457X
T2 - 24th International Speech Communication Association, Interspeech 2023
Y2 - 20 August 2023 through 24 August 2023
ER -