Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Reemt Hinrichs; Jonas Dunkel; Jorn Ostermann

doi:10.1109/ICFSP53514.2021.9646416

Details

Original language	English
Title of host publication	2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	6-11
Number of pages	6
ISBN (electronic)	9781665413459
ISBN (print)	978-1-6654-1346-6
Publication status	Published - 2021
Event	6th International Conference on Frontiers of Signal Processing, ICFSP 2021 - Paris, France Duration: 9 Sept 2021 → 11 Sept 2021

Abstract

Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

Keywords

automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution

ASJC Scopus subject areas

Computer Science(all)
Computer Networks and Communications
Computer Science(all)
Signal Processing

Cite this

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. / Hinrichs, Reemt; Dunkel, Jonas; Ostermann, Jorn.
2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 6-11.

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Hinrichs, R, Dunkel, J & Ostermann, J 2021, Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. in 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., pp. 6-11, 6th International Conference on Frontiers of Signal Processing, ICFSP 2021, Paris, France, 9 Sept 2021. https://doi.org/10.1109/ICFSP53514.2021.9646416

Hinrichs, R., Dunkel, J., & Ostermann, J. (2021). Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021 (pp. 6-11). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICFSP53514.2021.9646416

Hinrichs R, Dunkel J, Ostermann J. Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. In 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc. 2021. p. 6-11 doi: 10.1109/ICFSP53514.2021.9646416

Hinrichs, Reemt ; Dunkel, Jonas ; Ostermann, Jorn. / Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks. 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021. Institute of Electrical and Electronics Engineers Inc., 2021. pp. 6-11

Download

@inproceedings{424af9b4d88341db881d4cd84108557e,

title = "Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks",

abstract = "Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.",

keywords = "automatic speech recognition, convolutional neural networks, s-transform, time-frequency distribution, wigner-ville distribution",

author = "Reemt Hinrichs and Jonas Dunkel and Jorn Ostermann",

year = "2021",

doi = "10.1109/ICFSP53514.2021.9646416",

language = "English",

isbn = "978-1-6654-1346-6",

pages = "6--11",

booktitle = "2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

address = "United States",

note = "6th International Conference on Frontiers of Signal Processing, ICFSP 2021 ; Conference date: 09-09-2021 Through 11-09-2021",

}

Download

TY - GEN

T1 - Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

AU - Hinrichs, Reemt

AU - Dunkel, Jonas

AU - Ostermann, Jorn

PY - 2021

Y1 - 2021

N2 - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

AB - Automatic speech command recognition systems have become a common technology of the day to day life for many people. Smart devices usually offer some ability to understand more or less complex spoken commands. Many such speech recognition systems use some form of signal transformation as one of the first steps of the processing chain to obtain a time-frequency representation. A common approach is the transformation of the audio waveforms into spectrograms with subsequent computation of the mel-spectrograms or mel-frequency cepstral coefficients. However, superior time-frequency distributions (TFDs) have been proposed in the past to improve on the spectrogram. This work investigates the usefulness of various TFDs for use in automatic speech recognition algorithms using convolutional neural networks. On the Google Speech Command Dataset V1, the best single TFD was found to be the spectrogram with a window size of 1024 achieving a mean accuracy of 93.1%. However, a mean accuracy of 95.56 % was achieved through TFD mixing. Mixing of the TFDs thereby increased the mean accuracy by up to 2.46 % with respect to the individual TFDs.

KW - automatic speech recognition

KW - convolutional neural networks

KW - s-transform

KW - time-frequency distribution

KW - wigner-ville distribution

UR - http://www.scopus.com/inward/record.url?scp=85124146781&partnerID=8YFLogxK

U2 - 10.1109/ICFSP53514.2021.9646416

DO - 10.1109/ICFSP53514.2021.9646416

M3 - Conference contribution

AN - SCOPUS:85124146781

SN - 978-1-6654-1346-6

SP - 6

EP - 11

BT - 2021 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 6th International Conference on Frontiers of Signal Processing, ICFSP 2021

Y2 - 9 September 2021 through 11 September 2021

ER -

Research@Leibniz University

Mixing Time-Frequency Distributions for Speech Command Recognition Using Convolutional Neural Networks

Authors

Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Acoustic Emission Detection in Noisy Environments using Linear Prediction

Genie: the first open-source ISO/IEC encoder for genomic data

On the Rate-Distortion-Complexity Trade-Offs of Neural Video Coding

Self-supervised domain adaptation for machinery remaining useful life prediction

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression