Multimodal video concept detection via bag of auditory words and multiple kernel learning

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Autoren

  • Markus Mühling
  • Ralph Ewerth
  • Jun Zhou
  • Bernd Freisleben

Externe Organisationen

  • Philipps-Universität Marburg
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksAdvances in Multimedia Modeling
Untertitel18th International Conference, MMM 2012, Proceedings
Seiten40-50
Seitenumfang11
PublikationsstatusVeröffentlicht - 2012
Extern publiziertJa
Veranstaltung18th International Conference on Multimedia Modeling, MMM 2012 - Klagenfurt, Österreich
Dauer: 4 Jan. 20126 Jan. 2012

Publikationsreihe

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Band7131 LNCS
ISSN (Print)0302-9743
ISSN (elektronisch)1611-3349

Abstract

State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.

ASJC Scopus Sachgebiete

Zitieren

Multimodal video concept detection via bag of auditory words and multiple kernel learning. / Mühling, Markus; Ewerth, Ralph; Zhou, Jun et al.
Advances in Multimedia Modeling : 18th International Conference, MMM 2012, Proceedings. 2012. S. 40-50 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 7131 LNCS).

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandAufsatz in KonferenzbandForschungPeer-Review

Mühling, M, Ewerth, R, Zhou, J & Freisleben, B 2012, Multimodal video concept detection via bag of auditory words and multiple kernel learning. in Advances in Multimedia Modeling : 18th International Conference, MMM 2012, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bd. 7131 LNCS, S. 40-50, 18th International Conference on Multimedia Modeling, MMM 2012, Klagenfurt, Österreich, 4 Jan. 2012. https://doi.org/10.1007/978-3-642-27355-1_7
Mühling, M., Ewerth, R., Zhou, J., & Freisleben, B. (2012). Multimodal video concept detection via bag of auditory words and multiple kernel learning. In Advances in Multimedia Modeling : 18th International Conference, MMM 2012, Proceedings (S. 40-50). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 7131 LNCS). https://doi.org/10.1007/978-3-642-27355-1_7
Mühling M, Ewerth R, Zhou J, Freisleben B. Multimodal video concept detection via bag of auditory words and multiple kernel learning. in Advances in Multimedia Modeling : 18th International Conference, MMM 2012, Proceedings. 2012. S. 40-50. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-642-27355-1_7
Mühling, Markus ; Ewerth, Ralph ; Zhou, Jun et al. / Multimodal video concept detection via bag of auditory words and multiple kernel learning. Advances in Multimedia Modeling : 18th International Conference, MMM 2012, Proceedings. 2012. S. 40-50 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{28426ce07e4c4b248684164fa8a6c04a,
title = "Multimodal video concept detection via bag of auditory words and multiple kernel learning",
abstract = "State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.",
keywords = "audio codebook, bag of auditory words, bag of words, multiple kernel learning, video retrieval, Visual concept detection",
author = "Markus M{\"u}hling and Ralph Ewerth and Jun Zhou and Bernd Freisleben",
year = "2012",
doi = "10.1007/978-3-642-27355-1_7",
language = "English",
isbn = "9783642273544",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "40--50",
booktitle = "Advances in Multimedia Modeling",
note = "18th International Conference on Multimedia Modeling, MMM 2012 ; Conference date: 04-01-2012 Through 06-01-2012",

}

Download

TY - GEN

T1 - Multimodal video concept detection via bag of auditory words and multiple kernel learning

AU - Mühling, Markus

AU - Ewerth, Ralph

AU - Zhou, Jun

AU - Freisleben, Bernd

PY - 2012

Y1 - 2012

N2 - State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.

AB - State-of-the-art systems for video concept detection mainly rely on visual features. Some previous approaches have also included audio features, either using low-level features such as mel-frequency cepstral coefficients (MFCC) or exploiting the detection of specific audio concepts. In this paper, we investigate a bag of auditory words (BoAW) approach that models MFCC features in an auditory vocabulary. The resulting BoAW features are combined with state-of-the-art visual features via multiple kernel learning (MKL). Experiments on a large set of 101 video concepts from the MediaMill Challenge show the effectiveness of using BoAW features: The system using BoAW features and a support vector machine with a χ 2-kernel is superior to a state-of-the-art audio approach relying on probabilistic latent semantic indexing. Furthermore, it is shown that an early fusion approach degrades detection performance, whereas the combination of auditory and visual bag of words features via MKL yields a relative performance improvement of 9%.

KW - audio codebook

KW - bag of auditory words

KW - bag of words

KW - multiple kernel learning

KW - video retrieval

KW - Visual concept detection

UR - http://www.scopus.com/inward/record.url?scp=84862949691&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-27355-1_7

DO - 10.1007/978-3-642-27355-1_7

M3 - Conference contribution

AN - SCOPUS:84862949691

SN - 9783642273544

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 40

EP - 50

BT - Advances in Multimedia Modeling

T2 - 18th International Conference on Multimedia Modeling, MMM 2012

Y2 - 4 January 2012 through 6 January 2012

ER -