Estimating the information gap between textual and visual representations

Research output: Contribution to journalArticleResearchpeer review

Authors

  • Christian Henning
  • Ralph Ewerth

Research Organisations

External Research Organisations

  • ETH Zurich
  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Pages (from-to)43-56
Number of pages14
JournalInternational Journal of Multimedia Information Retrieval
Volume7
Issue number1
Publication statusPublished - 1 Mar 2018

Abstract

To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.

Keywords

    Deep learning, Multimodal embeddings, Text–image relations, Visual/verbal divide

ASJC Scopus subject areas

Cite this

Estimating the information gap between textual and visual representations. / Henning, Christian; Ewerth, Ralph.
In: International Journal of Multimedia Information Retrieval, Vol. 7, No. 1, 01.03.2018, p. 43-56.

Research output: Contribution to journalArticleResearchpeer review

Henning, C & Ewerth, R 2018, 'Estimating the information gap between textual and visual representations', International Journal of Multimedia Information Retrieval, vol. 7, no. 1, pp. 43-56. https://doi.org/10.1007/s13735-017-0142-y
Henning, C., & Ewerth, R. (2018). Estimating the information gap between textual and visual representations. International Journal of Multimedia Information Retrieval, 7(1), 43-56. https://doi.org/10.1007/s13735-017-0142-y
Henning C, Ewerth R. Estimating the information gap between textual and visual representations. International Journal of Multimedia Information Retrieval. 2018 Mar 1;7(1):43-56. doi: 10.1007/s13735-017-0142-y
Henning, Christian ; Ewerth, Ralph. / Estimating the information gap between textual and visual representations. In: International Journal of Multimedia Information Retrieval. 2018 ; Vol. 7, No. 1. pp. 43-56.
Download
@article{db12f9e94d21435dbffcb352d45c2a40,
title = "Estimating the information gap between textual and visual representations",
abstract = "To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.",
keywords = "Deep learning, Multimodal embeddings, Text–image relations, Visual/verbal divide",
author = "Christian Henning and Ralph Ewerth",
note = "Publisher Copyright: {\textcopyright} 2017, Springer-Verlag London Ltd., part of Springer Nature. Copyright: Copyright 2018 Elsevier B.V., All rights reserved.",
year = "2018",
month = mar,
day = "1",
doi = "10.1007/s13735-017-0142-y",
language = "English",
volume = "7",
pages = "43--56",
number = "1",

}

Download

TY - JOUR

T1 - Estimating the information gap between textual and visual representations

AU - Henning, Christian

AU - Ewerth, Ralph

N1 - Publisher Copyright: © 2017, Springer-Verlag London Ltd., part of Springer Nature. Copyright: Copyright 2018 Elsevier B.V., All rights reserved.

PY - 2018/3/1

Y1 - 2018/3/1

N2 - To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.

AB - To convey a complex matter, it is often beneficial to leverage two or more modalities. For example, slides are utilized to supplement an oral presentation, or photographs, drawings, figures, etc. are exploited in online news or scientific publications to complement textual information. However, the utilization of different modalities and their interrelations can be quite diverse. Sometimes, the transfer of information or knowledge may even be not eased, for instance, in case of contradictory information. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe two different dimensions of cross-modal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, two novel deep learning systems are suggested to estimate CMI and SC of textual and visual information. The first deep neural network consists of an autoencoder that maps images and texts onto a multimodal embedding space. This representation is then exploited in order to train classifiers for SC and CMI. An advantage of this representation is that only a small set of labeled training examples is required for the supervised learning process. Third, three different and large datasets are combined for autoencoder training to increase the diversity of (unlabeled) image–text pairs such that they properly capture the broad range of possible interrelations. Fourth, experimental results are reported for a challenging dataset. Finally, we discuss several applications for the proposed system and outline areas for future work.

KW - Deep learning

KW - Multimodal embeddings

KW - Text–image relations

KW - Visual/verbal divide

UR - http://www.scopus.com/inward/record.url?scp=85035768978&partnerID=8YFLogxK

U2 - 10.1007/s13735-017-0142-y

DO - 10.1007/s13735-017-0142-y

M3 - Article

AN - SCOPUS:85035768978

VL - 7

SP - 43

EP - 56

JO - International Journal of Multimedia Information Retrieval

JF - International Journal of Multimedia Information Retrieval

SN - 2192-6611

IS - 1

ER -