Details
Original language | English |
---|---|
Pages (from-to) | 981-990 |
Number of pages | 10 |
Journal | ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences |
Volume | 10 |
Issue number | 1 |
Publication status | Published - 5 Dec 2023 |
Event | ISPRS Geospatial Week 2023 - Kairo, Egypt Duration: 2 Sept 2023 → 7 Sept 2023 |
Abstract
The pixel-wise classification of land cover, i.e. the task of identifying the physical material of the Earth's surface in an image, is one of the basic applications of satellite image time series (SITS) processing. With the availability of large amounts of SITS it is possible to use supervised deep learning techniques such as Transformer models to analyse the Earth's surface at global scale and with high spatial and temporal resolution. While most approaches for land cover classification focus on the generation of a mono-temporal output map, we extend established deep learning models to multi-temporal input and output: using images acquired at different epochs we generate one output map for each input timestep. This has the advantage that the temporal change of land cover can be monitored. In addition, features conflicting over time are not averaged. We extend the Swin Transformer for SITS and introduce a new spatio-temporal transformer block (ST-TB) that extracts spatial and temporal features. We combine the ST-TB with the swin transformer block (STB) that is used in parallel for the individual input timesteps to extract spatial features. Furthermore, we investigate the usage of a temporal position encoding and different patch sizes. The latter is used to merge neighbouring pixels in the input embedding. Using SITS from Sentinel-2, the classification of land cover is improved by +1.8% in the mean F1-Score when using the ST-TB in the first stage of the Swin Transformer compared to a Swin Transformer without the ST-TB layer and by +1,6% compared to fully convolutional approaches. This demonstrates the advantage of the introduced ST-TB layer for the classification of SITS.
Keywords
- FCN, land cover classification, multi-temporal images, remote sensing, Swin Transformer
ASJC Scopus subject areas
- Physics and Astronomy(all)
- Instrumentation
- Environmental Science(all)
- Environmental Science (miscellaneous)
- Earth and Planetary Sciences(all)
- Earth and Planetary Sciences (miscellaneous)
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
In: ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. 10, No. 1, 05.12.2023, p. 981-990.
Research output: Contribution to journal › Conference article › Research › peer review
}
TY - JOUR
T1 - Transformer Models For Multi-Temporal Land Cover Classification Unsing Remote Sensing Images
AU - Voelsen, M.
AU - Lauble, S.
AU - Rottensteiner, F.
AU - Heipke, C.
N1 - Funding Information: We thank the German Land Survey Office of Lower Saxony (Landesamt für Geoinformation und Landesvermessung Niedersachsen - LGLN) for providing the data of the geospatial database and for their support of this project. We thank NVIDIA Corporation for providing GPU resources to this project.
PY - 2023/12/5
Y1 - 2023/12/5
N2 - The pixel-wise classification of land cover, i.e. the task of identifying the physical material of the Earth's surface in an image, is one of the basic applications of satellite image time series (SITS) processing. With the availability of large amounts of SITS it is possible to use supervised deep learning techniques such as Transformer models to analyse the Earth's surface at global scale and with high spatial and temporal resolution. While most approaches for land cover classification focus on the generation of a mono-temporal output map, we extend established deep learning models to multi-temporal input and output: using images acquired at different epochs we generate one output map for each input timestep. This has the advantage that the temporal change of land cover can be monitored. In addition, features conflicting over time are not averaged. We extend the Swin Transformer for SITS and introduce a new spatio-temporal transformer block (ST-TB) that extracts spatial and temporal features. We combine the ST-TB with the swin transformer block (STB) that is used in parallel for the individual input timesteps to extract spatial features. Furthermore, we investigate the usage of a temporal position encoding and different patch sizes. The latter is used to merge neighbouring pixels in the input embedding. Using SITS from Sentinel-2, the classification of land cover is improved by +1.8% in the mean F1-Score when using the ST-TB in the first stage of the Swin Transformer compared to a Swin Transformer without the ST-TB layer and by +1,6% compared to fully convolutional approaches. This demonstrates the advantage of the introduced ST-TB layer for the classification of SITS.
AB - The pixel-wise classification of land cover, i.e. the task of identifying the physical material of the Earth's surface in an image, is one of the basic applications of satellite image time series (SITS) processing. With the availability of large amounts of SITS it is possible to use supervised deep learning techniques such as Transformer models to analyse the Earth's surface at global scale and with high spatial and temporal resolution. While most approaches for land cover classification focus on the generation of a mono-temporal output map, we extend established deep learning models to multi-temporal input and output: using images acquired at different epochs we generate one output map for each input timestep. This has the advantage that the temporal change of land cover can be monitored. In addition, features conflicting over time are not averaged. We extend the Swin Transformer for SITS and introduce a new spatio-temporal transformer block (ST-TB) that extracts spatial and temporal features. We combine the ST-TB with the swin transformer block (STB) that is used in parallel for the individual input timesteps to extract spatial features. Furthermore, we investigate the usage of a temporal position encoding and different patch sizes. The latter is used to merge neighbouring pixels in the input embedding. Using SITS from Sentinel-2, the classification of land cover is improved by +1.8% in the mean F1-Score when using the ST-TB in the first stage of the Swin Transformer compared to a Swin Transformer without the ST-TB layer and by +1,6% compared to fully convolutional approaches. This demonstrates the advantage of the introduced ST-TB layer for the classification of SITS.
KW - FCN
KW - land cover classification
KW - multi-temporal images
KW - remote sensing
KW - Swin Transformer
UR - http://www.scopus.com/inward/record.url?scp=85179017131&partnerID=8YFLogxK
U2 - 10.5194/isprs-annals-X-1-W1-2023-981-2023
DO - 10.5194/isprs-annals-X-1-W1-2023-981-2023
M3 - Conference article
AN - SCOPUS:85179017131
VL - 10
SP - 981
EP - 990
JO - ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
JF - ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences
SN - 2194-9042
IS - 1
T2 - ISPRS Geospatial Week 2023
Y2 - 2 September 2023 through 7 September 2023
ER -