Details
Original language | English |
---|---|
Title of host publication | 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 6-11 |
Number of pages | 6 |
ISBN (electronic) | 9781665413459 |
ISBN (print) | 978-1-6654-1346-6 |
Publication status | Published - 2021 |
Event | 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 - Paris, France Duration: 9 Sept 2021 → 11 Sept 2021 |
Abstract
Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.
Keywords
- automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution
ASJC Scopus subject areas
- Computer Science(all)
- Computer Networks and Communications
- Computer Science(all)
- Signal Processing
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 6-11.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks
AU - Hinrichs, Reemt
AU - Dunkel, Jonas
AU - Ostermann, Jorn
PY - 2021
Y1 - 2021
N2 - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.
AB - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.
KW - automatic speech recognition
KW - convolutional neural networks
KW - s-transform
KW - time-frequency distribution
KW - wigner-ville distribution
UR - http://www.scopus.com/inward/record.url?scp=85124146781&partnerID=8YFLogxK
U2 - 10.1109/ICFSP53514.2021.9646416
DO - 10.1109/ICFSP53514.2021.9646416
M3 - Conference contribution
AN - SCOPUS:85124146781
SN - 978-1-6654-1346-6
SP - 6
EP - 11
BT - 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
Y2 - 9 September 2021 through 11 September 2021
ER -