M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Authors

Research Organisations

External Research Organisations

  • Edith Cowan University
View graph of relations

Details

Original languageEnglish
Title of host publicationImage and Vision Computing
Subtitle of host publication37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers
EditorsWei Qi Yan, Minh Nguyen, Martin Stommel
PublisherSpringer Science and Business Media Deutschland GmbH
Pages246-261
Number of pages16
ISBN (electronic)978-3-031-25825-1
ISBN (print)9783031258244
Publication statusPublished - 2023
Event37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022 - Auckland, New Zealand
Duration: 24 Nov 202225 Nov 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13836 LNCS
ISSN (Print)0302-9743
ISSN (electronic)1611-3349

Abstract

In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

Keywords

    Embodied AI, Multiple Object Tracking, Scene Understanding

ASJC Scopus subject areas

Cite this

M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. / Khan, Mariia; Abu-Khalaf, Jumana; Suter, David et al.
Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. ed. / Wei Qi Yan; Minh Nguyen; Martin Stommel. Springer Science and Business Media Deutschland GmbH, 2023. p. 246-261 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13836 LNCS).

Research output: Chapter in book/report/conference proceedingConference contributionResearchpeer review

Khan, M, Abu-Khalaf, J, Suter, D & Rosenhahn, B 2023, M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. in WQ Yan, M Nguyen & M Stommel (eds), Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13836 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 246-261, 37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022, Auckland, New Zealand, 24 Nov 2022. https://doi.org/10.1007/978-3-031-25825-1_18
Khan, M., Abu-Khalaf, J., Suter, D., & Rosenhahn, B. (2023). M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. In W. Q. Yan, M. Nguyen, & M. Stommel (Eds.), Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers (pp. 246-261). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13836 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-25825-1_18
Khan M, Abu-Khalaf J, Suter D, Rosenhahn B. M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. In Yan WQ, Nguyen M, Stommel M, editors, Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. Springer Science and Business Media Deutschland GmbH. 2023. p. 246-261. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). Epub 2023 Feb 4. doi: 10.1007/978-3-031-25825-1_18
Khan, Mariia ; Abu-Khalaf, Jumana ; Suter, David et al. / M3T : Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. editor / Wei Qi Yan ; Minh Nguyen ; Martin Stommel. Springer Science and Business Media Deutschland GmbH, 2023. pp. 246-261 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Download
@inproceedings{f66aa93d2f084ea49f2208a469ef4958,
title = "M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks",
abstract = "In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent{\textquoteright}s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.",
keywords = "Embodied AI, Multiple Object Tracking, Scene Understanding",
author = "Mariia Khan and Jumana Abu-Khalaf and David Suter and Bodo Rosenhahn",
year = "2023",
doi = "10.1007/978-3-031-25825-1_18",
language = "English",
isbn = "9783031258244",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Science and Business Media Deutschland GmbH",
pages = "246--261",
editor = "Yan, {Wei Qi} and Minh Nguyen and Martin Stommel",
booktitle = "Image and Vision Computing",
address = "Germany",
note = "37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022 ; Conference date: 24-11-2022 Through 25-11-2022",

}

Download

TY - GEN

T1 - M3T

T2 - 37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022

AU - Khan, Mariia

AU - Abu-Khalaf, Jumana

AU - Suter, David

AU - Rosenhahn, Bodo

PY - 2023

Y1 - 2023

N2 - In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

AB - In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

KW - Embodied AI

KW - Multiple Object Tracking

KW - Scene Understanding

UR - http://www.scopus.com/inward/record.url?scp=85147999282&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-25825-1_18

DO - 10.1007/978-3-031-25825-1_18

M3 - Conference contribution

AN - SCOPUS:85147999282

SN - 9783031258244

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 246

EP - 261

BT - Image and Vision Computing

A2 - Yan, Wei Qi

A2 - Nguyen, Minh

A2 - Stommel, Martin

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 24 November 2022 through 25 November 2022

ER -

By the same author(s)