Computational Approaches for the Interpretation of Image-Text Relations

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandBeitrag in Buch/SammelwerkForschungPeer-Review

Autoren

  • Ralph Ewerth
  • Christian Otto
  • Eric Müller-Budack

Organisationseinheiten

Externe Organisationen

  • Friedrich-Schiller-Universität Jena
  • Ernst-Abbe-Hochschule Jena (EAH)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Titel des SammelwerksEmpirical Multimodality Research
UntertitelMethods, Evaluations, Implications
Herausgeber/-innen Jana Pflaeging, Janina Wildfeuer , John A. Bateman
Herausgeber (Verlag)de Gruyter
Seiten109-138
Seitenumfang30
ISBN (elektronisch)9783110725001
ISBN (Print)9783110724912
PublikationsstatusVeröffentlicht - 1 Jan. 2021

Abstract

In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

ASJC Scopus Sachgebiete

Zitieren

Computational Approaches for the Interpretation of Image-Text Relations. / Ewerth, Ralph; Otto, Christian; Müller-Budack, Eric.
Empirical Multimodality Research: Methods, Evaluations, Implications. Hrsg. / Jana Pflaeging; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. S. 109-138.

Publikation: Beitrag in Buch/Bericht/Sammelwerk/KonferenzbandBeitrag in Buch/SammelwerkForschungPeer-Review

Ewerth, R, Otto, C & Müller-Budack, E 2021, Computational Approaches for the Interpretation of Image-Text Relations. in J Pflaeging, J Wildfeuer & JA Bateman (Hrsg.), Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter, S. 109-138. https://doi.org/10.1515/9783110725001-005
Ewerth, R., Otto, C., & Müller-Budack, E. (2021). Computational Approaches for the Interpretation of Image-Text Relations. In J. Pflaeging, J. Wildfeuer , & J. A. Bateman (Hrsg.), Empirical Multimodality Research: Methods, Evaluations, Implications (S. 109-138). de Gruyter. https://doi.org/10.1515/9783110725001-005
Ewerth R, Otto C, Müller-Budack E. Computational Approaches for the Interpretation of Image-Text Relations. in Pflaeging J, Wildfeuer J, Bateman JA, Hrsg., Empirical Multimodality Research: Methods, Evaluations, Implications. de Gruyter. 2021. S. 109-138 doi: 10.1515/9783110725001-005
Ewerth, Ralph ; Otto, Christian ; Müller-Budack, Eric. / Computational Approaches for the Interpretation of Image-Text Relations. Empirical Multimodality Research: Methods, Evaluations, Implications. Hrsg. / Jana Pflaeging ; Janina Wildfeuer ; John A. Bateman. de Gruyter, 2021. S. 109-138
Download
@inbook{0b5b0f7e477c4ae3b13d3e6395b33980,
title = "Computational Approaches for the Interpretation of Image-Text Relations",
abstract = "In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.",
keywords = "Computer vision, Deep learning, Multimodal information retrieval, Multimodal news analytics, Multimodal semiotic analysis, Semantic image-text classes",
author = "Ralph Ewerth and Christian Otto and Eric M{\"u}ller-Budack",
year = "2021",
month = jan,
day = "1",
doi = "10.1515/9783110725001-005",
language = "English",
isbn = "9783110724912",
pages = "109--138",
editor = "Pflaeging, { Jana} and {Wildfeuer }, { Janina} and Bateman, { John A. }",
booktitle = "Empirical Multimodality Research",
publisher = "de Gruyter",
address = "Germany",

}

Download

TY - CHAP

T1 - Computational Approaches for the Interpretation of Image-Text Relations

AU - Ewerth, Ralph

AU - Otto, Christian

AU - Müller-Budack, Eric

PY - 2021/1/1

Y1 - 2021/1/1

N2 - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

AB - In this paper, we present approaches that automatically estimate semantic relations between textual and (pictorial) visual information.We consider the interpretation of these relations as one of the key elements for empirical research on multimodal information. From a computational perspective, it is difficult to automatically “comprehend” the meaning of multimodal information and to interpret cross-modal semantic relations. One reason is that already the automatic understanding and interpretation of a single source of information (e.g., text, image, or audio) is difficult — and it is even more difficult to model and understand the interplay of two different modalities. While the complex interplay of visual and textual information has been investigated in communication sciences and linguistics for years, they have been rarely considered from a computer science perspective. To this end, we review the few currently existing approaches to automatically recognize semantic cross-modal relations. In previous work, we have suggested to model image-text relations along three main dimensions: Cross-modal mutual information, semantic correlation, and the status relation. Using these dimensions, we characterized a set of eight image-text classes and showed their relations to existing taxonomies. Moreover, we have shown how the cross-modal mutual information can be further differentiated in order to measure image-text consistency in news at the entity level of persons, locations, and scene context. Experimental results demonstrate the feasibility of the approaches.

KW - Computer vision

KW - Deep learning

KW - Multimodal information retrieval

KW - Multimodal news analytics

KW - Multimodal semiotic analysis

KW - Semantic image-text classes

UR - http://www.scopus.com/inward/record.url?scp=85135273643&partnerID=8YFLogxK

U2 - 10.1515/9783110725001-005

DO - 10.1515/9783110725001-005

M3 - Contribution to book/anthology

AN - SCOPUS:85135273643

SN - 9783110724912

SP - 109

EP - 138

BT - Empirical Multimodality Research

A2 - Pflaeging, Jana

A2 - Wildfeuer , Janina

A2 - Bateman, John A.

PB - de Gruyter

ER -