Characterization and classification of semantic image-text relations

Research output: Contribution to journalArticleResearchpeer review

Authors

  • Christian Otto
  • Matthias Springstein
  • Avishek Anand
  • Ralph Ewerth

Research Organisations

External Research Organisations

  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Pages (from-to)31-45
Number of pages15
JournalInternational Journal of Multimedia Information Retrieval
Volume9
Issue number1
Early online date22 Jan 2020
Publication statusPublished - Mar 2020

Abstract

The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

Keywords

    Data augmentation, Image-text class, Multimodality, Semantic gap

ASJC Scopus subject areas

Cite this

Characterization and classification of semantic image-text relations. / Otto, Christian; Springstein, Matthias; Anand, Avishek et al.
In: International Journal of Multimedia Information Retrieval, Vol. 9, No. 1, 03.2020, p. 31-45.

Research output: Contribution to journalArticleResearchpeer review

Otto, C, Springstein, M, Anand, A & Ewerth, R 2020, 'Characterization and classification of semantic image-text relations', International Journal of Multimedia Information Retrieval, vol. 9, no. 1, pp. 31-45. https://doi.org/10.1007/s13735-019-00187-6
Otto, C., Springstein, M., Anand, A., & Ewerth, R. (2020). Characterization and classification of semantic image-text relations. International Journal of Multimedia Information Retrieval, 9(1), 31-45. https://doi.org/10.1007/s13735-019-00187-6
Otto C, Springstein M, Anand A, Ewerth R. Characterization and classification of semantic image-text relations. International Journal of Multimedia Information Retrieval. 2020 Mar;9(1):31-45. Epub 2020 Jan 22. doi: 10.1007/s13735-019-00187-6
Otto, Christian ; Springstein, Matthias ; Anand, Avishek et al. / Characterization and classification of semantic image-text relations. In: International Journal of Multimedia Information Retrieval. 2020 ; Vol. 9, No. 1. pp. 31-45.
Download
@article{cc603d4ec452451db3f63607fad06eea,
title = "Characterization and classification of semantic image-text relations",
abstract = "The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.",
keywords = "Data augmentation, Image-text class, Multimodality, Semantic gap",
author = "Christian Otto and Matthias Springstein and Avishek Anand and Ralph Ewerth",
note = "Funding Information: Open Access funding provided by Projekt DEAL. Part of this work is financially supported by the Leibniz Association, Germany (Leibniz Competition 2018, funding line “Collaborative Excellence”, Project SALIENT [K68/2017]). ",
year = "2020",
month = mar,
doi = "10.1007/s13735-019-00187-6",
language = "English",
volume = "9",
pages = "31--45",
number = "1",

}

Download

TY - JOUR

T1 - Characterization and classification of semantic image-text relations

AU - Otto, Christian

AU - Springstein, Matthias

AU - Anand, Avishek

AU - Ewerth, Ralph

N1 - Funding Information: Open Access funding provided by Projekt DEAL. Part of this work is financially supported by the Leibniz Association, Germany (Leibniz Competition 2018, funding line “Collaborative Excellence”, Project SALIENT [K68/2017]).

PY - 2020/3

Y1 - 2020/3

N2 - The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

AB - The beneficial, complementary nature of visual and textual information to convey information is widely known, for example, in entertainment, news, advertisements, science, or education. While the complex interplay of image and text to form semantic meaning has been thoroughly studied in linguistics and communication sciences for several decades, computer vision and multimedia research remained on the surface of the problem more or less. An exception is previous work that introduced the two metrics Cross-Modal Mutual Information and Semantic Correlation in order to model complex image-text relations. In this paper, we motivate the necessity of an additional metric called Status in order to cover complex image-text relations more completely. This set of metrics enables us to derive a novel categorization of eight semantic image-text classes based on three dimensions. In addition, we demonstrate how to automatically gather and augment a dataset for these classes from the Web. Further, we present a deep learning system to automatically predict either of the three metrics, as well as a system to directly predict the eight image-text classes. Experimental results show the feasibility of the approach, whereby the predict-all approach outperforms the cascaded approach of the metric classifiers.

KW - Data augmentation

KW - Image-text class

KW - Multimodality

KW - Semantic gap

UR - http://www.scopus.com/inward/record.url?scp=85078351928&partnerID=8YFLogxK

U2 - 10.1007/s13735-019-00187-6

DO - 10.1007/s13735-019-00187-6

M3 - Article

AN - SCOPUS:85078351928

VL - 9

SP - 31

EP - 45

JO - International Journal of Multimedia Information Retrieval

JF - International Journal of Multimedia Information Retrieval

SN - 2192-6611

IS - 1

ER -