Details
Originalsprache | Englisch |
---|---|
Seiten (von - bis) | 44-56 |
Seitenumfang | 13 |
Fachzeitschrift | Speech communication |
Jahrgang | 106 |
Frühes Online-Datum | 26 Nov. 2018 |
Publikationsstatus | Veröffentlicht - Jan. 2019 |
Abstract
In several applications of machine listening, predicting how well an automatic speech recognition system will perform before the actual decoding enables the system to adapt to unseen acoustic characteristics dynamically. Feedback about speech quality, for instance, could allow modern hearing aids to select a speech source in complex acoustic scenes with the aim of enhancing the speech intelligibility of a target speaker. In this study, we look at different performance measures to estimate the word error rates of simulated behind-the-ear hearing aid signals and detect the azimuth angle of the target source in 180-degree spatial scenes. These measures derive from phoneme posterior probabilities produced by a deep neural network acoustic model. However, the more complex the model is, the more computationally expensive it becomes to obtain these measures; therefore, we assess how the model size affects prediction performance. Our findings suggest measures derived from smaller nets are suitable to predict error rates of more complex models reliably enough to be implemented in hearing aid hardware.
ASJC Scopus Sachgebiete
- Informatik (insg.)
- Software
- Mathematik (insg.)
- Modellierung und Simulation
- Sozialwissenschaften (insg.)
- Kommunikation
- Geisteswissenschaftliche Fächer (insg.)
- Sprache und Linguistik
- Sozialwissenschaften (insg.)
- Linguistik und Sprache
- Informatik (insg.)
- Maschinelles Sehen und Mustererkennung
- Informatik (insg.)
- Angewandte Informatik
Ziele für nachhaltige Entwicklung
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
in: Speech communication, Jahrgang 106, 01.2019, S. 44-56.
Publikation: Beitrag in Fachzeitschrift › Artikel › Forschung › Peer-Review
}
TY - JOUR
T1 - DNN-based performance measures for predicting error rates in automatic speech recognition and optimizing hearing aid parameters
AU - Castro Martinez, A.M.
AU - Gerlach, Lukas
AU - Payá-Vayá, Guillermo
AU - Hermansky, Hynek
AU - Ooster, Jasper
AU - Meyer, Bernd T.
N1 - Funding Information: This work was funded by the DFG (Cluster of Excellence 1077/1 Hearing4All (http://hearing4all.eu)).
PY - 2019/1
Y1 - 2019/1
N2 - In several applications of machine listening, predicting how well an automatic speech recognition system will perform before the actual decoding enables the system to adapt to unseen acoustic characteristics dynamically. Feedback about speech quality, for instance, could allow modern hearing aids to select a speech source in complex acoustic scenes with the aim of enhancing the speech intelligibility of a target speaker. In this study, we look at different performance measures to estimate the word error rates of simulated behind-the-ear hearing aid signals and detect the azimuth angle of the target source in 180-degree spatial scenes. These measures derive from phoneme posterior probabilities produced by a deep neural network acoustic model. However, the more complex the model is, the more computationally expensive it becomes to obtain these measures; therefore, we assess how the model size affects prediction performance. Our findings suggest measures derived from smaller nets are suitable to predict error rates of more complex models reliably enough to be implemented in hearing aid hardware.
AB - In several applications of machine listening, predicting how well an automatic speech recognition system will perform before the actual decoding enables the system to adapt to unseen acoustic characteristics dynamically. Feedback about speech quality, for instance, could allow modern hearing aids to select a speech source in complex acoustic scenes with the aim of enhancing the speech intelligibility of a target speaker. In this study, we look at different performance measures to estimate the word error rates of simulated behind-the-ear hearing aid signals and detect the azimuth angle of the target source in 180-degree spatial scenes. These measures derive from phoneme posterior probabilities produced by a deep neural network acoustic model. However, the more complex the model is, the more computationally expensive it becomes to obtain these measures; therefore, we assess how the model size affects prediction performance. Our findings suggest measures derived from smaller nets are suitable to predict error rates of more complex models reliably enough to be implemented in hearing aid hardware.
KW - Automatic speech recognition
KW - Hearing aids
KW - Performance monitoring
KW - Spatial filtering
UR - http://www.scopus.com/inward/record.url?scp=85057441627&partnerID=8YFLogxK
U2 - 10.1016/j.specom.2018.11.006
DO - 10.1016/j.specom.2018.11.006
M3 - Article
VL - 106
SP - 44
EP - 56
JO - Speech communication
JF - Speech communication
SN - 0167-6393
ER -