M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks

Mariia Khan; Jumana Abu-Khalaf; David Suter; Bodo Rosenhahn

doi:10.1007/978-3-031-25825-1_18

Details

Original language	English
Title of host publication	Image and Vision Computing
Subtitle of host publication	37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers
Editors	Wei Qi Yan, Minh Nguyen, Martin Stommel
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	246-261
Number of pages	16
ISBN (electronic)	978-3-031-25825-1
ISBN (print)	9783031258244
Publication status	Published - 2023
Event	37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022 - Auckland, New Zealand Duration: 24 Nov 2022 → 25 Nov 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13836 LNCS
ISSN (Print)	0302-9743
ISSN (electronic)	1611-3349

Abstract

In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP₇₅. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

Keywords

Embodied AI, Multiple Object Tracking, Scene Understanding

ASJC Scopus subject areas

Mathematics(all)
Theoretical Computer Science
Computer Science(all)
General Computer Science

Cite this

M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. / Khan, Mariia; Abu-Khalaf, Jumana; Suter, David et al.
Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. ed. / Wei Qi Yan; Minh Nguyen; Martin Stommel. Springer Science and Business Media Deutschland GmbH, 2023. p. 246-261 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13836 LNCS).

Research output: Chapter in book/report/conference proceeding › Conference contribution › Research › peer review

Khan, M, Abu-Khalaf, J, Suter, D & Rosenhahn, B 2023, M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. in WQ Yan, M Nguyen & M Stommel (eds), Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13836 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 246-261, 37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022, Auckland, New Zealand, 24 Nov 2022. https://doi.org/10.1007/978-3-031-25825-1_18

Khan, M., Abu-Khalaf, J., Suter, D., & Rosenhahn, B. (2023). M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. In W. Q. Yan, M. Nguyen, & M. Stommel (Eds.), Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers (pp. 246-261). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13836 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-25825-1_18

Khan M, Abu-Khalaf J, Suter D, Rosenhahn B. M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. In Yan WQ, Nguyen M, Stommel M, editors, Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. Springer Science and Business Media Deutschland GmbH. 2023. p. 246-261. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). Epub 2023 Feb 4. doi: 10.1007/978-3-031-25825-1_18

Khan, Mariia ; Abu-Khalaf, Jumana ; Suter, David et al. / M3T : Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks. Image and Vision Computing: 37th International Conference, IVCNZ 2022, Auckland, New Zealand, November 24–25, 2022, Revised Selected Papers. editor / Wei Qi Yan ; Minh Nguyen ; Martin Stommel. Springer Science and Business Media Deutschland GmbH, 2023. pp. 246-261 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

Download

@inproceedings{f66aa93d2f084ea49f2208a469ef4958,

title = "M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks",

abstract = "In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent{\textquoteright}s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.",

keywords = "Embodied AI, Multiple Object Tracking, Scene Understanding",

author = "Mariia Khan and Jumana Abu-Khalaf and David Suter and Bodo Rosenhahn",

year = "2023",

doi = "10.1007/978-3-031-25825-1_18",

language = "English",

isbn = "9783031258244",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "246--261",

editor = "Yan, {Wei Qi} and Minh Nguyen and Martin Stommel",

booktitle = "Image and Vision Computing",

address = "Germany",

note = "37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022 ; Conference date: 24-11-2022 Through 25-11-2022",

}

Download

TY - GEN

T1 - M3T

T2 - 37th International Conference on Image and Vision Computing New Zealand, IVCNZ 2022

AU - Khan, Mariia

AU - Abu-Khalaf, Jumana

AU - Suter, David

AU - Rosenhahn, Bodo

PY - 2023

Y1 - 2023

N2 - In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

AB - In this paper, we propose an extended multiple object tracking (MOT) task definition for embodied AI visual exploration research task - multi-class, multi-instance and multi-view object tracking (M3T). The aim of the proposed M3T task is to identify the unique number of objects in the environment, observed on the agent’s way, and visible from far or close view, from different angles or visible only partially. Classic MOT algorithms are not applicable for the M3T task, as they typically target moving single-class multiple object instances in one video and track objects, visible from only one angle or camera viewpoint. Thus, we present the M3T-Round algorithm designed for a simple scenario, where an agent takes 12 image frames, while rotating 360° from the initial position in a scene. We, first, detect each object in all image frames and then track objects (without any training), using cosine similarity metric for association of object tracks. The detector part of our M3T-Round algorithm is compatible with the baseline YOLOv4 algorithm [1] in terms of detection accuracy: a 5.26 point improvement in AP75. The tracker part of our M3T-Round algorithm shows a 4.6 point improvement in HOTA over GMOTv2 algorithm [2], a recent, high-performance tracking method. Moreover, we have collected a new challenging tracking dataset from AI2-Thor [3] simulator for training and evaluation of the proposed M3T-Round algorithm.

KW - Embodied AI

KW - Multiple Object Tracking

KW - Scene Understanding

UR - http://www.scopus.com/inward/record.url?scp=85147999282&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-25825-1_18

DO - 10.1007/978-3-031-25825-1_18

M3 - Conference contribution

AN - SCOPUS:85147999282

SN - 9783031258244

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 246

EP - 261

BT - Image and Vision Computing

A2 - Yan, Wei Qi

A2 - Nguyen, Minh

A2 - Stommel, Martin

PB - Springer Science and Business Media Deutschland GmbH

Y2 - 24 November 2022 through 25 November 2022

ER -

Research@Leibniz University

M3T: Multi-class Multi-instance Multi-view Object Tracking for Embodied AI Tasks

Authors

Research Organisations

External Research Organisations

Details

Publication series

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Robust Shape Fitting for 3D Scene Abstraction

Quantum normalizing flows for anomaly detection

A variational autoencoder trained with priors from canonical pathways increases the interpretability of transcriptome data

PARSAC: Accelerating Robust Multi-Model Fitting with Parallel Sample Consensus

Q-SENN: Quantized Self-Explaining Neural Networks