Understanding, Categorizing and Predicting Semantic Image-Text Relations

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

  • Christian Otto
  • Matthias Springstein
  • Avishek Anand
  • Ralph Ewerth

Research Organisations

External Research Organisations

  • German National Library of Science and Technology (TIB)
View graph of relations

Details

Original languageEnglish
Title of host publicationICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval
PublisherAssociation for Computing Machinery (ACM)
Pages168-176
Number of pages9
ISBN (electronic)9781450367653
Publication statusPublished - 5 Jun 2019
Event2019 ACM International Conference on Multimedia Retrieval, ICMR 2019 - Ottawa, Canada
Duration: 10 Jun 201913 Jun 2019

Abstract

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

Keywords

    Data augmentation, Image-text class, Multimodality, Semantic gap

ASJC Scopus subject areas

Cite this

Understanding, Categorizing and Predicting Semantic Image-Text Relations. / Otto, Christian; Springstein, Matthias; Anand, Avishek et al.
ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery (ACM), 2019. p. 168-176.

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Otto, C, Springstein, M, Anand, A & Ewerth, R 2019, Understanding, Categorizing and Predicting Semantic Image-Text Relations. in ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery (ACM), pp. 168-176, 2019 ACM International Conference on Multimedia Retrieval, ICMR 2019, Ottawa, Canada, 10 Jun 2019. https://doi.org/10.48550/arXiv.1906.08595, https://doi.org/10.1145/3323873.3325049
Otto, C., Springstein, M., Anand, A., & Ewerth, R. (2019). Understanding, Categorizing and Predicting Semantic Image-Text Relations. In ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval (pp. 168-176). Association for Computing Machinery (ACM). https://doi.org/10.48550/arXiv.1906.08595, https://doi.org/10.1145/3323873.3325049
Otto C, Springstein M, Anand A, Ewerth R. Understanding, Categorizing and Predicting Semantic Image-Text Relations. In ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery (ACM). 2019. p. 168-176 doi: 10.48550/arXiv.1906.08595, 10.1145/3323873.3325049
Otto, Christian ; Springstein, Matthias ; Anand, Avishek et al. / Understanding, Categorizing and Predicting Semantic Image-Text Relations. ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery (ACM), 2019. pp. 168-176
Download
@inproceedings{fc3182aee0a54b6cb7e108363d91717e,
title = "Understanding, Categorizing and Predicting Semantic Image-Text Relations",
abstract = "Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., {"}illustration{"} or {"}anchorage{"}) and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.",
keywords = "Data augmentation, Image-text class, Multimodality, Semantic gap",
author = "Christian Otto and Matthias Springstein and Avishek Anand and Ralph Ewerth",
year = "2019",
month = jun,
day = "5",
doi = "10.48550/arXiv.1906.08595",
language = "English",
pages = "168--176",
booktitle = "ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval",
publisher = "Association for Computing Machinery (ACM)",
address = "United States",
note = "2019 ACM International Conference on Multimedia Retrieval, ICMR 2019 ; Conference date: 10-06-2019 Through 13-06-2019",

}

Download

TY - GEN

T1 - Understanding, Categorizing and Predicting Semantic Image-Text Relations

AU - Otto, Christian

AU - Springstein, Matthias

AU - Anand, Avishek

AU - Ewerth, Ralph

PY - 2019/6/5

Y1 - 2019/6/5

N2 - Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

AB - Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.

KW - Data augmentation

KW - Image-text class

KW - Multimodality

KW - Semantic gap

UR - http://www.scopus.com/inward/record.url?scp=85068029321&partnerID=8YFLogxK

U2 - 10.48550/arXiv.1906.08595

DO - 10.48550/arXiv.1906.08595

M3 - Conference contribution

AN - SCOPUS:85068029321

SP - 168

EP - 176

BT - ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval

PB - Association for Computing Machinery (ACM)

T2 - 2019 ACM International Conference on Multimedia Retrieval, ICMR 2019

Y2 - 10 June 2019 through 13 June 2019

ER -