Details
Original language | English |
---|---|
Title of host publication | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 18166-18175 |
Number of pages | 10 |
ISBN (electronic) | 978-1-6654-6946-3 |
ISBN (print) | 978-1-6654-6947-0 |
Publication status | Published - 2022 |
Publication series
Name | Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition |
---|---|
Volume | 2022-June |
ISSN (Print) | 1063-6919 |
Abstract
Keywords
- cs.CV, cs.LG, Image and video synthesis and generation, Vision + language
ASJC Scopus subject areas
- Computer Science(all)
- Software
- Computer Science(all)
- Computer Vision and Pattern Recognition
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Institute of Electrical and Electronics Engineers Inc., 2022. p. 18166-18175 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; Vol. 2022-June).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Text to Image Generation with Semantic-Spatial Aware GAN
AU - Liao, Wentong
AU - Hu, Kai
AU - Yang, Michael Ying
AU - Rosenhahn, Bodo
N1 - Funding Information: This work has been supported by the Federal Ministry of Education and Research (BMBF), Ger- many, under the project LeibnizKILabor (grant no. 01DD20003), the Center for Digital Innova- tions (ZDIN) and the Deutsche Forschungsgemein- schaft (DFG) under Germany’s Excellence Strategy within the Cluster of Excellence PhoenixD (EXC 2122).
PY - 2022
Y1 - 2022
N2 - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.
AB - A text to image generation (T2I) model aims to generate photo-realistic images which are semantically consistent with the text descriptions. Built upon the recent advances in generative adversarial networks (GANs), existing T2I models have made great progress. However, a close inspection of their generated images reveals two major limitations: (1) The condition batch normalization methods are applied on the whole image feature maps equally, ignoring the local semantics; (2) The text encoder is fixed during training, which should be trained with the image generator jointly to learn better text representations for image generation. To address these limitations, we propose a novel framework Semantic-Spatial Aware GAN, which is trained in an end-to-end fashion so that the text encoder can exploit better text information. Concretely, we introduce a novel Semantic-Spatial Aware Convolution Network, which (1) learns semantic-adaptive transformation conditioned on text to effectively fuse text features and image features, and (2) learns a mask map in a weakly-supervised way that depends on the current text-image fusion process in order to guide the transformation spatially. Experiments on the challenging COCO and CUB bird datasets demonstrate the advantage of our method over the recent state-of-the-art approaches, regarding both visual fidelity and alignment with input text description. Code is available at https://github.com/wtliao/text2image.
KW - cs.CV
KW - cs.LG
KW - Image and video synthesis and generation
KW - Vision + language
UR - http://www.scopus.com/inward/record.url?scp=85139192930&partnerID=8YFLogxK
U2 - 10.1109/CVPR52688.2022.01765
DO - 10.1109/CVPR52688.2022.01765
M3 - Conference contribution
SN - 978-1-6654-6947-0
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 18166
EP - 18175
BT - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
PB - Institute of Electrical and Electronics Engineers Inc.
ER -