Details
Original language | English |
---|---|
Title of host publication | ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval |
Publisher | Association for Computing Machinery (ACM) |
Pages | 168-176 |
Number of pages | 9 |
ISBN (electronic) | 9781450367653 |
Publication status | Published - 5 Jun 2019 |
Event | 2019 ACM International Conference on Multimedia Retrieval, ICMR 2019 - Ottawa, Canada Duration: 10 Jun 2019 → 13 Jun 2019 |
Abstract
Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
Keywords
- Data augmentation, Image-text class, Multimodality, Semantic gap
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Computer Graphics and Computer-Aided Design
- Computer Science(all)
- Computer Vision and Pattern Recognition
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval. Association for Computing Machinery (ACM), 2019. p. 168-176.
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Understanding, Categorizing and Predicting Semantic Image-Text Relations
AU - Otto, Christian
AU - Springstein, Matthias
AU - Anand, Avishek
AU - Ewerth, Ralph
PY - 2019/6/5
Y1 - 2019/6/5
N2 - Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
AB - Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of datasets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.
KW - Data augmentation
KW - Image-text class
KW - Multimodality
KW - Semantic gap
UR - http://www.scopus.com/inward/record.url?scp=85068029321&partnerID=8YFLogxK
U2 - 10.48550/arXiv.1906.08595
DO - 10.48550/arXiv.1906.08595
M3 - Conference contribution
AN - SCOPUS:85068029321
SP - 168
EP - 176
BT - ICMR 2019 - Proceedings of the 2019 ACM International Conference on Multimedia Retrieval
PB - Association for Computing Machinery (ACM)
T2 - 2019 ACM International Conference on Multimedia Retrieval, ICMR 2019
Y2 - 10 June 2019 through 13 June 2019
ER -