Details
Originalsprache | Englisch |
---|---|
Titel des Sammelwerks | Computer Vision – ACCV 2020 |
Untertitel | 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV |
Herausgeber/-innen | Hiroshi Ishikawa, Cheng-Lin Liu, Tomas Pajdla, Jianbo Shi |
Seiten | 153-169 |
Seitenumfang | 17 |
ISBN (elektronisch) | 978-3-030-69538-5 |
Publikationsstatus | Veröffentlicht - 2021 |
Publikationsreihe
Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
---|---|
Band | 12625 LNCS |
ISSN (Print) | 0302-9743 |
ISSN (elektronisch) | 1611-3349 |
Abstract
ASJC Scopus Sachgebiete
- Mathematik (insg.)
- Theoretische Informatik
- Informatik (insg.)
- Allgemeine Computerwissenschaft
Zitieren
- Standard
- Harvard
- Apa
- Vancouver
- BibTex
- RIS
Computer Vision – ACCV 2020: 15th Asian Conference on Computer Vision, Kyoto, Japan, November 30 – December 4, 2020, Revised Selected Papers, Part IV. Hrsg. / Hiroshi Ishikawa; Cheng-Lin Liu; Tomas Pajdla; Jianbo Shi. 2021. S. 153-169 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Band 12625 LNCS).
Publikation: Beitrag in Buch/Bericht/Sammelwerk/Konferenzband › Aufsatz in Konferenzband › Forschung
}
TY - GEN
T1 - Image Captioning through Image Transformer
AU - He, Sen
AU - Liao, Wentong
AU - Tavakoli, Hamed R.
AU - Yang, Michael
AU - Rosenhahn, Bodo
AU - Pugeault, Nicolas
PY - 2021
Y1 - 2021
N2 - Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
AB - Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the \textit{transformer} architecture for image captioning. However, the structure between the \textit{semantic units} in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the \textbf{\textit{image transformer}}, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.
KW - cs.CV
UR - http://www.scopus.com/inward/record.url?scp=85103275378&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-69538-5_10
DO - 10.1007/978-3-030-69538-5_10
M3 - Conference contribution
SN - 978-3-030-69537-8
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 153
EP - 169
BT - Computer Vision – ACCV 2020
A2 - Ishikawa, Hiroshi
A2 - Liu, Cheng-Lin
A2 - Pajdla, Tomas
A2 - Shi, Jianbo
ER -