Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autorschaft

Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des Sammelwerks2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
Herausgeber (Verlag)Institute of Electrical and Electronics Engineers Inc.
Seiten6-11
Seitenumfang6
ISBN (elektronisch)9781665413459
ISBN (Print)978-1-6654-1346-6
PublikationsstatusVeröffentlicht - 2021
Veranstaltung6th International Conference on Frontiers of Signal Processing, ICFSP 2021 - Paris, Frankreich
Dauer: 9 Sept. 202111 Sept. 2021

Abstract

Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

ASJC Scopus Sachgebiete

Zitieren

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. / Hinrichs, Reemt; Dunkel, Jonas; Ostermann, Jorn.
2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. S. 6-11.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Hinrichs, R, Dunkel, J & Ostermann, J 2021, Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., S. 6-11, 6th International Conference on Frontiers of Signal Processing, ICFSP 2021, Paris, Frankreich, 9 Sept. 2021. https://doi.org/10.1109/ICFSP53514.2021.9646416
Hinrichs, R., Dunkel, J., & Ostermann, J. (2021). Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 (S. 6-11). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICFSP53514.2021.9646416
Hinrichs R, Dunkel J, Ostermann J. Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc. 2021. S. 6-11 doi: 10.1109/ICFSP53514.2021.9646416
Hinrichs, Reemt ; Dunkel, Jonas ; Ostermann, Jorn. / Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. S. 6-11
Download
@inproceedings{424af9b4d88341db881d4cd84108557e,
title = "Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks",
abstract = "Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.",
keywords = "automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution",
author = "Reemt Hinrichs and Jonas Dunkel and Jorn Ostermann",
year = "2021",
doi = "10.1109/ICFSP53514.2021.9646416",
language = "English",
isbn = "978-1-6654-1346-6",
pages = "6--11",
booktitle = "2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",
note = "6th International Conference on Frontiers of Signal Processing, ICFSP 2021 ; Conference date: 09-09-2021 Through 11-09-2021",

}

Download

TY - GEN

T1 - Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

AU - Hinrichs, Reemt

AU - Dunkel, Jonas

AU - Ostermann, Jorn

PY - 2021

Y1 - 2021

N2 - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

AB - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

KW - automatic speech recognition

KW - convolutional neural networks

KW - s-transform

KW - time-frequency distribution

KW - wigner-ville distribution

UR - http://www.scopus.com/inward/record.url?scp=85124146781&partnerID=8YFLogxK

U2 - 10.1109/ICFSP53514.2021.9646416

DO - 10.1109/ICFSP53514.2021.9646416

M3 - Conference contribution

AN - SCOPUS:85124146781

SN - 978-1-6654-1346-6

SP - 6

EP - 11

BT - 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

Y2 - 9 September 2021 through 11 September 2021

ER -

Von denselben Autoren