Estimating the Information Gap between Textual and Visual Representations

Christian Henning; Ralph Ewerth

doi:10.1145/3078971.3078991

Details

Original language	English
Title of host publication	ICMR 2017
Subtitle of host publication	Proceedings of the 2017 ACM International Conference on Multimedia Retrieval
Pages	14-22
Number of pages	9
ISBN (electronic)	9781450347013
Publication status	Published - 6 Jun 2017
Event	17th ACM International Conference on Multimedia Retrieval, ICMR 2017 - Bucharest, Romania Duration: 6 Jun 2017 → 9 Jun 2017

Publication series

Name	ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval

Abstract

Photos, drawings, figures, etc. supplement textual information in various kinds of media, for example, in web news or scientific publications. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe crossmodal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach.

ASJC Scopus subject areas

Computer Science(all)
Human-Computer Interaction
Computer Science(all)
Software
Computer Science(all)
Computer Graphics and Computer-Aided Design

Cite this

Estimating the Information Gap between Textual and Visual Representations. / Henning, Christian; Ewerth, Ralph.
ICMR 2017 : Proceedings of the 2017 ACM International Conference on Multimedia Retrieval. 2017. p. 14-22 (ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Henning, C & Ewerth, R 2017, Estimating the Information Gap between Textual and Visual Representations. in ICMR 2017 : Proceedings of the 2017 ACM International Conference on Multimedia Retrieval. ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval, pp. 14-22, 17th ACM International Conference on Multimedia Retrieval, ICMR 2017, Bucharest, Romania, 6 Jun 2017. https://doi.org/10.1145/3078971.3078991

Henning, C., & Ewerth, R. (2017). Estimating the Information Gap between Textual and Visual Representations. In ICMR 2017 : Proceedings of the 2017 ACM International Conference on Multimedia Retrieval (pp. 14-22). (ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval). https://doi.org/10.1145/3078971.3078991

Henning C, Ewerth R. Estimating the Information Gap between Textual and Visual Representations. In ICMR 2017 : Proceedings of the 2017 ACM International Conference on Multimedia Retrieval. 2017. p. 14-22. (ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval). doi: 10.1145/3078971.3078991

Henning, Christian ; Ewerth, Ralph. / Estimating the Information Gap between Textual and Visual Representations. ICMR 2017 : Proceedings of the 2017 ACM International Conference on Multimedia Retrieval. 2017. pp. 14-22 (ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval).

Download

@inproceedings{a0291c13e6d74baa96412917ec5c8ba6,

title = "Estimating the Information Gap between Textual and Visual Representations",

abstract = "Photos, drawings, figures, etc. supplement textual information in various kinds of media, for example, in web news or scientific publications. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe crossmodal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach.",

author = "Christian Henning and Ralph Ewerth",

year = "2017",

month = jun,

day = "6",

doi = "10.1145/3078971.3078991",

language = "English",

series = "ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval",

pages = "14--22",

booktitle = "ICMR 2017",

note = "17th ACM International Conference on Multimedia Retrieval, ICMR 2017 ; Conference date: 06-06-2017 Through 09-06-2017",

}

Download

TY - GEN

T1 - Estimating the Information Gap between Textual and Visual Representations

AU - Henning, Christian

AU - Ewerth, Ralph

PY - 2017/6/6

Y1 - 2017/6/6

N2 - Photos, drawings, figures, etc. supplement textual information in various kinds of media, for example, in web news or scientific publications. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe crossmodal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach.

AB - Photos, drawings, figures, etc. supplement textual information in various kinds of media, for example, in web news or scientific publications. In this respect, the intended effect of an image can be quite different, e.g., providing additional information, focusing on certain details of surrounding text, or simply being a general illustration of a topic. As a consequence, the semantic correlation between information of different modalities can vary noticeably, too. Moreover, cross-modal interrelations are often hard to describe in a precise way. The variety of possible interrelations of textual and graphical information and the question, how they can be described and automatically estimated have not been addressed yet by previous work. In this paper, we present several contributions to close this gap. First, we introduce two measures to describe crossmodal interrelations: cross-modal mutual information (CMI) and semantic correlation (SC). Second, a novel approach relying on deep learning is suggested to estimate CMI and SC of textual and visual information. Third, three diverse datasets are leveraged to learn an appropriate deep neural network model for the demanding task. The system has been evaluated on a challenging test set and the experimental results demonstrate the feasibility of the approach.

UR - http://www.scopus.com/inward/record.url?scp=85021839630&partnerID=8YFLogxK

U2 - 10.1145/3078971.3078991

DO - 10.1145/3078971.3078991

M3 - Conference contribution

AN - SCOPUS:85021839630

T3 - ICMR 2017 - Proceedings of the 2017 ACM International Conference on Multimedia Retrieval

SP - 14

EP - 22

BT - ICMR 2017

T2 - 17th ACM International Conference on Multimedia Retrieval, ICMR 2017

Y2 - 6 June 2017 through 9 June 2017

ER -

Research@Leibniz University

Estimating the Information Gap between Textual and Visual Representations

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

ASJC Scopus subject areas

Cite this