Details
Original language | English |
---|---|
Title of host publication | 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) |
Pages | 9777-9783 |
Number of pages | 7 |
ISBN (electronic) | 979-8-3503-7770-5 |
Publication status | Published - 14 Oct 2024 |
Event | 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024 - Abu Dhabi, United Arab Emirates Duration: 14 Oct 2024 → 18 Oct 2024 |
Publication series
Name | IEEE International Conference on Intelligent Robots and Systems |
---|---|
ISSN (Print) | 2153-0858 |
ISSN (electronic) | 2153-0866 |
Abstract
Understanding of scene changes is crucial for embodied AI applications, such as visual room rearrangement, where the agent must revert changes by restoring the objects to their original locations or states. Visual changes between two scenes, pre- and post-rearrangement, encompass two tasks: scene change detection (locating changes) and image difference captioning (describing changes). While previous methods, focused on sequential 2D images, have addressed these tasks separately, it is essential to emphasize the significance of their combination. Therefore, we propose a new Scene Change Understanding (SCU) task for simultaneous change detection and description. Moreover, we go beyond change language description generation and aim to generate rearrangement instructions for the robotic agent to revert changes. To solve this task, we propose a novel method - EmbSCU, which allows to compare instance-level change object masks (for 53 frequently-seen indoor object classes) before and after changes and generate rearrangement language instructions for the agent. EmbSCU is built on our Segment Any Object Model (SAOMv2) - a fine-tuned version of Segment Anything Model (SAM), adapted to obtain instance-level object masks for both foreground and background objects in indoor embodied environments. EmbSCU is evaluated on our own dataset of sequential 2D image pairs before and after changes, collected from the Ai2Thor simulator. The proposed framework achieves promising results in both change detection and change description. Moreover, EmbSCU demonstrates positive generalization results on real-world scenes without using any real-life data during training. The dataset and the code are available here.
ASJC Scopus subject areas
- Engineering(all)
- Control and Systems Engineering
- Computer Science(all)
- Software
- Computer Science(all)
- Computer Vision and Pattern Recognition
- Computer Science(all)
- Computer Science Applications
Cite this
- Standard
- Harvard
- Apa
- Vancouver
- BibTeX
- RIS
2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). 2024. p. 9777-9783 (IEEE International Conference on Intelligent Robots and Systems).
Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review
}
TY - GEN
T1 - Indoor Scene Change Understanding (SCU)
T2 - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2024
AU - Khan, Mariia
AU - Qiu, Yue
AU - Cong, Yuren
AU - Rosenhahn, Bodo
AU - Suter, David
AU - Abu-Khalaf, Jumana
N1 - Publisher Copyright: © 2024 IEEE.
PY - 2024/10/14
Y1 - 2024/10/14
N2 - Understanding of scene changes is crucial for embodied AI applications, such as visual room rearrangement, where the agent must revert changes by restoring the objects to their original locations or states. Visual changes between two scenes, pre- and post-rearrangement, encompass two tasks: scene change detection (locating changes) and image difference captioning (describing changes). While previous methods, focused on sequential 2D images, have addressed these tasks separately, it is essential to emphasize the significance of their combination. Therefore, we propose a new Scene Change Understanding (SCU) task for simultaneous change detection and description. Moreover, we go beyond change language description generation and aim to generate rearrangement instructions for the robotic agent to revert changes. To solve this task, we propose a novel method - EmbSCU, which allows to compare instance-level change object masks (for 53 frequently-seen indoor object classes) before and after changes and generate rearrangement language instructions for the agent. EmbSCU is built on our Segment Any Object Model (SAOMv2) - a fine-tuned version of Segment Anything Model (SAM), adapted to obtain instance-level object masks for both foreground and background objects in indoor embodied environments. EmbSCU is evaluated on our own dataset of sequential 2D image pairs before and after changes, collected from the Ai2Thor simulator. The proposed framework achieves promising results in both change detection and change description. Moreover, EmbSCU demonstrates positive generalization results on real-world scenes without using any real-life data during training. The dataset and the code are available here.
AB - Understanding of scene changes is crucial for embodied AI applications, such as visual room rearrangement, where the agent must revert changes by restoring the objects to their original locations or states. Visual changes between two scenes, pre- and post-rearrangement, encompass two tasks: scene change detection (locating changes) and image difference captioning (describing changes). While previous methods, focused on sequential 2D images, have addressed these tasks separately, it is essential to emphasize the significance of their combination. Therefore, we propose a new Scene Change Understanding (SCU) task for simultaneous change detection and description. Moreover, we go beyond change language description generation and aim to generate rearrangement instructions for the robotic agent to revert changes. To solve this task, we propose a novel method - EmbSCU, which allows to compare instance-level change object masks (for 53 frequently-seen indoor object classes) before and after changes and generate rearrangement language instructions for the agent. EmbSCU is built on our Segment Any Object Model (SAOMv2) - a fine-tuned version of Segment Anything Model (SAM), adapted to obtain instance-level object masks for both foreground and background objects in indoor embodied environments. EmbSCU is evaluated on our own dataset of sequential 2D image pairs before and after changes, collected from the Ai2Thor simulator. The proposed framework achieves promising results in both change detection and change description. Moreover, EmbSCU demonstrates positive generalization results on real-world scenes without using any real-life data during training. The dataset and the code are available here.
UR - http://www.scopus.com/inward/record.url?scp=85216500891&partnerID=8YFLogxK
U2 - 10.1109/IROS58592.2024.10801354
DO - 10.1109/IROS58592.2024.10801354
M3 - Conference contribution
AN - SCOPUS:85216500891
SN - 979-8-3503-7771-2
T3 - IEEE International Conference on Intelligent Robots and Systems
SP - 9777
EP - 9783
BT - 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Y2 - 14 October 2024 through 18 October 2024
ER -