Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Reemt Hinrichs; Jonas Dunkel; Jorn Ostermann

doi:10.1109/ICFSP53514.2021.9646416

Details

Originalsprache	Englisch
Titel des Sammelwerks	2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
Herausgeber (Verlag)	Institute of Electrical and Electronics Engineers Inc.
Seiten	6-11
Seitenumfang	6
ISBN (elektronisch)	9781665413459
ISBN (Print)	978-1-6654-1346-6
Publikationsstatus	Veröffentlicht - 2021
Veranstaltung	6th International Conference on Frontiers of Signal Processing, ICFSP 2021 - Paris, Frankreich Dauer: 9 Sept. 2021 → 11 Sept. 2021

Abstract

Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

ASJC Scopus Sachgebiete

Informatik (insg.)
Computernetzwerke und -kommunikation
Informatik (insg.)
Signalverarbeitung

Zitieren

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. / Hinrichs, Reemt; Dunkel, Jonas; Ostermann, Jorn.
2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. S. 6-11.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung › Peer-Review

Hinrichs, R, Dunkel, J & Ostermann, J 2021, Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., S. 6-11, 6th International Conference on Frontiers of Signal Processing, ICFSP 2021, Paris, Frankreich, 9 Sept. 2021. https://doi.org/10.1109/ICFSP53514.2021.9646416

Hinrichs, R., Dunkel, J., & Ostermann, J. (2021). Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 (S. 6-11). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICFSP53514.2021.9646416

Hinrichs R, Dunkel J, Ostermann J. Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc. 2021. S. 6-11 doi: 10.1109/ICFSP53514.2021.9646416

Hinrichs, Reemt ; Dunkel, Jonas ; Ostermann, Jorn. / Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. S. 6-11

Download

@inproceedings{424af9b4d88341db881d4cd84108557e,

title = "Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks",

abstract = "Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.",

keywords = "automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution",

author = "Reemt Hinrichs and Jonas Dunkel and Jorn Ostermann",

year = "2021",

doi = "10.1109/ICFSP53514.2021.9646416",

language = "English",

isbn = "978-1-6654-1346-6",

pages = "6--11",

booktitle = "2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

address = "United States",

note = "6th International Conference on Frontiers of Signal Processing, ICFSP 2021 ; Conference date: 09-09-2021 Through 11-09-2021",

}

Download

TY - GEN

T1 - Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

AU - Hinrichs, Reemt

AU - Dunkel, Jonas

AU - Ostermann, Jorn

PY - 2021

Y1 - 2021

N2 - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

AB - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

KW - automatic speech recognition

KW - convolutional neural networks

KW - s-transform

KW - time-frequency distribution

KW - wigner-ville distribution

UR - http://www.scopus.com/inward/record.url?scp=85124146781&partnerID=8YFLogxK

U2 - 10.1109/ICFSP53514.2021.9646416

DO - 10.1109/ICFSP53514.2021.9646416

M3 - Conference contribution

AN - SCOPUS:85124146781

SN - 978-1-6654-1346-6

SP - 6

EP - 11

BT - 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

Y2 - 9 September 2021 through 11 September 2021

ER -

Research@Leibniz University

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Autorschaft

Organisationseinheiten

Details

Abstract

ASJC Scopus Sachgebiete

Zitieren

Von denselben Autoren

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data