Computational Approaches for the Interpretation of Image-Text Relations

Ralph Ewerth; Christian Otto; Eric Müller-Budack

doi:10.1515/9783110725001-005

Details

Original language	English
Title of host publication	Empirical Multimodality Research
Subtitle of host publication	Methods, Evaluations, Implications
Editors	Jana Pflaeging, Janina Wildfeuer , John A. Bateman
Publisher	de Gruyter
Pages	109-138
Number of pages	30
ISBN (electronic)	9783110725001
ISBN (print)	9783110724912
Publication status	Published - 1 Jan 2021

Abstract

In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

Keywords

Computer vision, Deep learning, Multimodal information retrieval, Multimodal news analytics, Multimodal semiotic analysis, Semantic image-text classes

ASJC Scopus subject areas

Arts and Humanities(all)
General Arts and Humanities
Social Sciences(all)
General Social Sciences

Cite this

Computational Approaches for the Interpretation of Image-Text Relations. / Ewerth, Ralph; Otto, Christian; Müller-Budack, Eric.
Empirical Multimodality Research: Methods, Evaluations, Implications. ed. / Jana Pflaeging; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. p. 109-138.

Research output: Chapter in book/report/conference proceeding › Contribution to book/anthology › Research › peer review

Ewerth, R, Otto, C & Müller-Budack, E 2021, Computational Approaches for the Interpretation of Image-Text Relations. in J Pflaeging, J Wildfeuer & JA Bateman (eds), Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter, pp. 109-138. https://doi.org/10.1515/9783110725001-005

Ewerth, R., Otto, C., & Müller-Budack, E. (2021). Computational Approaches for the Interpretation of Image-Text Relations. In J. Pflaeging, J. Wildfeuer , & J. A. Bateman (Eds.), Empirical Multimodality Research: Methods, Evaluations, Implications (pp. 109-138). de Gruyter. https://doi.org/10.1515/9783110725001-005

Ewerth R, Otto C, Müller-Budack E. Computational Approaches for the Interpretation of Image-Text Relations. In Pflaeging J, Wildfeuer J, Bateman JA, editors, Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter. 2021. p. 109-138 doi: 10.1515/9783110725001-005

Ewerth, Ralph ; Otto, Christian ; Müller-Budack, Eric. / Computational Approaches for the Interpretation of Image-Text Relations. Empirical Multimodality Research: Methods, Evaluations, Implications. editor / Jana Pflaeging ; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. pp. 109-138

Download

@inbook{0b5b0f7e477c4ae3b13d3e6395b33980,

title = "Computational Approaches for the Interpretation of Image-Text Relations",

abstract = "In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.",

keywords = "Computer vision, Deep learning, Multimodal information retrieval, Multimodal news analytics, Multimodal semiotic analysis, Semantic image-text classes",

author = "Ralph Ewerth and Christian Otto and Eric M{\"u}ller-Budack",

year = "2021",

month = jan,

day = "1",

doi = "10.1515/9783110725001-005",

language = "English",

isbn = "9783110724912",

pages = "109--138",

editor = "Pflaeging, { Jana} and {Wildfeuer }, { Janina} and Bateman, { John A. }",

booktitle = "Empirical Multimodality Research",

publisher = "de Gruyter",

address = "Germany",

}

Download

TY - CHAP

T1 - Computational Approaches for the Interpretation of Image-Text Relations

AU - Ewerth, Ralph

AU - Otto, Christian

AU - Müller-Budack, Eric

PY - 2021/1/1

Y1 - 2021/1/1

N2 - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

AB - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

KW - Computer vision

KW - Deep learning

KW - Multimodal information retrieval

KW - Multimodal news analytics

KW - Multimodal semiotic analysis

KW - Semantic image-text classes

UR - http://www.scopus.com/inward/record.url?scp=85135273643&partnerID=8YFLogxK

U2 - 10.1515/9783110725001-005

DO - 10.1515/9783110725001-005

M3 - Contribution to book/anthology

AN - SCOPUS:85135273643

SN - 9783110724912

SP - 109

EP - 138

BT - Empirical Multimodality Research

A2 - Pflaeging, Jana

A2 - Wildfeuer , Janina

A2 - Bateman, John A.

PB - de Gruyter

ER -

Research@Leibniz University

Computational Approaches for the Interpretation of Image-Text Relations

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this