Computational Approaches for the Interpretation of Image-Text Relations

Research output: Chapter in book/report/conference proceedingContribution to book/anthologyResearchpeer review

Authors

  • Ralph Ewerth
  • Christian Otto
  • Eric Müller-Budack

Research Organisations

External Research Organisations

  • Friedrich Schiller University Jena
  • Jena University of Applied Sciences (EAH)
View graph of relations

Details

Original languageEnglish
Title of host publicationEmpirical Multimodality Research
Subtitle of host publicationMethods, Evaluations, Implications
Editors Jana Pflaeging, Janina Wildfeuer , John A. Bateman
Publisherde Gruyter
Pages109-138
Number of pages30
ISBN (electronic)9783110725001
ISBN (print)9783110724912
Publication statusPublished - 1 Jan 2021

Abstract

In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

Keywords

    Computer vision, Deep learning, Multimodal information retrieval, Multimodal news analytics, Multimodal semiotic analysis, Semantic image-text classes

ASJC Scopus subject areas

Cite this

Computational Approaches for the Interpretation of Image-Text Relations. / Ewerth, Ralph; Otto, Christian; Müller-Budack, Eric.
Empirical Multimodality Research: Methods, Evaluations, Implications. ed. / Jana Pflaeging; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. p. 109-138.

Research output: Chapter in book/report/conference proceedingContribution to book/anthologyResearchpeer review

Ewerth, R, Otto, C & Müller-Budack, E 2021, Computational Approaches for the Interpretation of Image-Text Relations. in J Pflaeging, J Wildfeuer & JA Bateman (eds), Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter, pp. 109-138. https://doi.org/10.1515/9783110725001-005
Ewerth, R., Otto, C., & Müller-Budack, E. (2021). Computational Approaches for the Interpretation of Image-Text Relations. In J. Pflaeging, J. Wildfeuer , & J. A. Bateman (Eds.), Empirical Multimodality Research: Methods, Evaluations, Implications (pp. 109-138). de Gruyter. https://doi.org/10.1515/9783110725001-005
Ewerth R, Otto C, Müller-Budack E. Computational Approaches for the Interpretation of Image-Text Relations. In Pflaeging J, Wildfeuer J, Bateman JA, editors, Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter. 2021. p. 109-138 doi: 10.1515/9783110725001-005
Ewerth, Ralph ; Otto, Christian ; Müller-Budack, Eric. / Computational Approaches for the Interpretation of Image-Text Relations. Empirical Multimodality Research: Methods, Evaluations, Implications. editor / Jana Pflaeging ; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. pp. 109-138
Download
@inbook{0b5b0f7e477c4ae3b13d3e6395b33980,
title = "Computational Approaches for the Interpretation of Image-Text Relations",
abstract = "In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.",
keywords = "Computer vision, Deep learning, Multimodal information retrieval, Multimodal news analytics, Multimodal semiotic analysis, Semantic image-text classes",
author = "Ralph Ewerth and Christian Otto and Eric M{\"u}ller-Budack",
year = "2021",
month = jan,
day = "1",
doi = "10.1515/9783110725001-005",
language = "English",
isbn = "9783110724912",
pages = "109--138",
editor = "Pflaeging, { Jana} and {Wildfeuer }, { Janina} and Bateman, { John A. }",
booktitle = "Empirical Multimodality Research",
publisher = "de Gruyter",
address = "Germany",

}

Download

TY - CHAP

T1 - Computational Approaches for the Interpretation of Image-Text Relations

AU - Ewerth, Ralph

AU - Otto, Christian

AU - Müller-Budack, Eric

PY - 2021/1/1

Y1 - 2021/1/1

N2 - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

AB - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

KW - Computer vision

KW - Deep learning

KW - Multimodal information retrieval

KW - Multimodal news analytics

KW - Multimodal semiotic analysis

KW - Semantic image-text classes

UR - http://www.scopus.com/inward/record.url?scp=85135273643&partnerID=8YFLogxK

U2 - 10.1515/9783110725001-005

DO - 10.1515/9783110725001-005

M3 - Contribution to book/anthology

AN - SCOPUS:85135273643

SN - 9783110724912

SP - 109

EP - 138

BT - Empirical Multimodality Research

A2 - Pflaeging, Jana

A2 - Wildfeuer , Janina

A2 - Bateman, John A.

PB - de Gruyter

ER -