MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Autoren

  • Yi Hsin Chen
  • Hong Sheng Xie
  • Cheng Wei Chen
  • Zong Lin Gao
  • Martin Benjak
  • Wen Hsiao Peng
  • Jorn Ostermann

Externe Organisationen

  • National Yang Ming Chiao Tung University (NSTC)
Forschungs-netzwerk anzeigen

Details

OriginalspracheEnglisch
Seiten (von - bis)1
Seitenumfang1
FachzeitschriftIEEE Transactions on Circuits and Systems for Video Technology
Jahrgang34
Ausgabenummer11
PublikationsstatusVeröffentlicht - 12 Juli 2024

Abstract

Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.

ASJC Scopus Sachgebiete

Zitieren

MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression. / Chen, Yi Hsin; Xie, Hong Sheng; Chen, Cheng Wei et al.
in: IEEE Transactions on Circuits and Systems for Video Technology, Jahrgang 34, Nr. 11, 12.07.2024, S. 1.

Publikation: Beitrag in FachzeitschriftArtikelForschungPeer-Review

Chen YH, Xie HS, Chen CW, Gao ZL, Benjak M, Peng WH et al. MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression. IEEE Transactions on Circuits and Systems for Video Technology. 2024 Jul 12;34(11):1. doi: 10.1109/TCSVT.2024.3427426, 10.48550/arXiv.2312.15829
Chen, Yi Hsin ; Xie, Hong Sheng ; Chen, Cheng Wei et al. / MaskCRT : Masked Conditional Residual Transformer for Learned Video Compression. in: IEEE Transactions on Circuits and Systems for Video Technology. 2024 ; Jahrgang 34, Nr. 11. S. 1.
Download
@article{bc6f5c43c3d147e795f5cdff9088acec,
title = "MaskCRT: Masked Conditional Residual Transformer for Learned Video Compression",
abstract = "Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.",
keywords = "Encoding, Entropy, Feature extraction, Image coding, Learned video compression, masked conditional residual coding, Transformer-based video compression, Transformers, Video codecs, Video compression",
author = "Chen, {Yi Hsin} and Xie, {Hong Sheng} and Chen, {Cheng Wei} and Gao, {Zong Lin} and Martin Benjak and Peng, {Wen Hsiao} and Jorn Ostermann",
note = "Publisher Copyright: IEEE",
year = "2024",
month = jul,
day = "12",
doi = "10.1109/TCSVT.2024.3427426",
language = "English",
volume = "34",
pages = "1",
journal = "IEEE Transactions on Circuits and Systems for Video Technology",
issn = "1051-8215",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "11",

}

Download

TY - JOUR

T1 - MaskCRT

T2 - Masked Conditional Residual Transformer for Learned Video Compression

AU - Chen, Yi Hsin

AU - Xie, Hong Sheng

AU - Chen, Cheng Wei

AU - Gao, Zong Lin

AU - Benjak, Martin

AU - Peng, Wen Hsiao

AU - Ostermann, Jorn

N1 - Publisher Copyright: IEEE

PY - 2024/7/12

Y1 - 2024/7/12

N2 - Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.

AB - Conditional coding has lately emerged as the main-stream approach to learned video compression. However, a recent study shows that it may perform worse than residual coding when the information bottleneck arises. Conditional residual coding was thus proposed, creating a new school of thought to improve on conditional coding. Notably, conditional residual coding relies heavily on the assumption that the residual frame has a lower entropy rate than that of the intra frame. Recognizing that this assumption is not always true due to dis-occlusion phenomena or unreliable motion estimates, we propose a masked conditional residual coding scheme. It learns a soft mask to form a hybrid of conditional coding and conditional residual coding in a pixel adaptive manner. We introduce a Transformer-based conditional autoencoder. Several strategies are investigated with regard to how to condition a Transformer-based autoencoder for inter-frame coding, a topic that is largely under-explored. Additionally, we propose a channel transform module (CTM) to decorrelate the image latents along the channel dimension, with the aim of using the simple hyperprior to approach similar compression performance to the channel-wise autoregressive model. Experimental results confirm the superiority of our masked conditional residual transformer (termed MaskCRT) to both conditional coding and conditional residual coding. On commonly used datasets, MaskCRT shows comparable BD-rate results to VTM-17.0 under the low delay P configuration in terms of PSNR-RGB and outperforms VTM-17.0 in terms of MS-SSIM-RGB. It also opens up a new research direction for advancing learned video compression.

KW - Encoding

KW - Entropy

KW - Feature extraction

KW - Image coding

KW - Learned video compression

KW - masked conditional residual coding

KW - Transformer-based video compression

KW - Transformers

KW - Video codecs

KW - Video compression

UR - http://www.scopus.com/inward/record.url?scp=85198379094&partnerID=8YFLogxK

U2 - 10.1109/TCSVT.2024.3427426

DO - 10.1109/TCSVT.2024.3427426

M3 - Article

AN - SCOPUS:85198379094

VL - 34

SP - 1

JO - IEEE Transactions on Circuits and Systems for Video Technology

JF - IEEE Transactions on Circuits and Systems for Video Technology

SN - 1051-8215

IS - 11

ER -

Von denselben Autoren