Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

View graph of relations

Details

Original languageEnglish
Title of host publication2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6-11
Number of pages6
ISBN (electronic)9781665413459
ISBN (print)978-1-6654-1346-6
Publication statusPublished - 2021
Event6th International Conference on Frontiers of Signal Processing, ICFSP 2021 - Paris, France
Duration: 9 Sept 202111 Sept 2021

Abstract

Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

Keywords

    automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution

ASJC Scopus subject areas

Cite this

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. / Hinrichs, Reemt; Dunkel, Jonas; Ostermann, Jorn.
2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 6-11.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Hinrichs, R, Dunkel, J & Ostermann, J 2021, Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., pp. 6-11, 6th International Conference on Frontiers of Signal Processing, ICFSP 2021, Paris, France, 9 Sept 2021. https://doi.org/10.1109/ICFSP53514.2021.9646416
Hinrichs, R., Dunkel, J., & Ostermann, J. (2021). Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 (pp. 6-11). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICFSP53514.2021.9646416
Hinrichs R, Dunkel J, Ostermann J. Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc. 2021. p. 6-11 doi: 10.1109/ICFSP53514.2021.9646416
Hinrichs, Reemt ; Dunkel, Jonas ; Ostermann, Jorn. / Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. pp. 6-11
Download
@inproceedings{424af9b4d88341db881d4cd84108557e,
title = "Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks",
abstract = "Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.",
keywords = "automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution",
author = "Reemt Hinrichs and Jonas Dunkel and Jorn Ostermann",
year = "2021",
doi = "10.1109/ICFSP53514.2021.9646416",
language = "English",
isbn = "978-1-6654-1346-6",
pages = "6--11",
booktitle = "2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",
note = "6th International Conference on Frontiers of Signal Processing, ICFSP 2021 ; Conference date: 09-09-2021 Through 11-09-2021",

}

Download

TY - GEN

T1 - Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

AU - Hinrichs, Reemt

AU - Dunkel, Jonas

AU - Ostermann, Jorn

PY - 2021

Y1 - 2021

N2 - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

AB - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

KW - automatic speech recognition

KW - convolutional neural networks

KW - s-transform

KW - time-frequency distribution

KW - wigner-ville distribution

UR - http://www.scopus.com/inward/record.url?scp=85124146781&partnerID=8YFLogxK

U2 - 10.1109/ICFSP53514.2021.9646416

DO - 10.1109/ICFSP53514.2021.9646416

M3 - Conference contribution

AN - SCOPUS:85124146781

SN - 978-1-6654-1346-6

SP - 6

EP - 11

BT - 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

Y2 - 9 September 2021 through 11 September 2021

ER -

By the same author(s)