LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Tuan T. Nguyen; Hoang H. Nguyen; Mina Sartipi; Marco Fisichella

doi:10.1007/s10994-024-06592-1

Details

Original language	English
Pages (from-to)	6811–6837
Number of pages	27
Journal	Machine learning
Volume	113
Issue number	9
Early online date	15 Jul 2024
Publication status	Published - Sept 2024

Abstract

Multi-target multi-camera tracking is crucial to intelligent transportation systems. Numerous recent studies have been undertaken to address this issue. Nevertheless, using the approaches in real-world situations is challenging due to the scarcity of publicly available data and the laborious process of manually annotating the new dataset and creating a tailored rule-based matching system for each camera scenario. To address this issue, we present a novel solution termed LaMMOn, an end-to-end transformer and graph neural network-based multi-camera tracking model. LaMMOn consists of three main modules: (1) Language Model Detection (LMD) for object detection; (2) Language and Graph Model Association module (LGMA) for object tracking and trajectory clustering; (3) Text-to-embedding module (T2E) that overcome the problem of data limitation by synthesizing the object embedding from defined texts. LaMMOn can be run online in real-time scenarios and achieve a competitive result on many datasets, e.g., CityFlow (HOTA 76.46%), I24 (HOTA 25.7%), and TrackCUIP (HOTA 80.94%) with an acceptable FPS (from 12.20 to 13.37) for an online application.

Keywords

Language model, Mtmct, Multi-camera tracking, Object tracking

ASJC Scopus subject areas

Computer Science(all)
Software
Computer Science(all)
Artificial Intelligence

Cite this

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios. / Nguyen, Tuan T.; Nguyen, Hoang H.; Sartipi, Mina et al.
In: Machine learning, Vol. 113, No. 9, 09.2024, p. 6811–6837.

Research output: Contribution to journal › Article › Research › peer review

Nguyen, TT, Nguyen, HH, Sartipi, M & Fisichella, M 2024, 'LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios', Machine learning, vol. 113, no. 9, pp. 6811–6837. https://doi.org/10.1007/s10994-024-06592-1

Nguyen, T. T., Nguyen, H. H., Sartipi, M., & Fisichella, M. (2024). LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios. Machine learning, 113(9), 6811–6837. https://doi.org/10.1007/s10994-024-06592-1

Nguyen TT, Nguyen HH, Sartipi M, Fisichella M. LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios. Machine learning. 2024 Sept;113(9):6811–6837. Epub 2024 Jul 15. doi: 10.1007/s10994-024-06592-1

Nguyen, Tuan T. ; Nguyen, Hoang H. ; Sartipi, Mina et al. / LaMMOn : language model combined graph neural network for multi-target multi-camera tracking in online scenarios. In: Machine learning. 2024 ; Vol. 113, No. 9. pp. 6811–6837.

Download

@article{3add23712e2f4aa4966de3fb3766b661,

title = "LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios",

abstract = "Multi-target multi-camera tracking is crucial to intelligent transportation systems. Numerous recent studies have been undertaken to address this issue. Nevertheless, using the approaches in real-world situations is challenging due to the scarcity of publicly available data and the laborious process of manually annotating the new dataset and creating a tailored rule-based matching system for each camera scenario. To address this issue, we present a novel solution termed LaMMOn, an end-to-end transformer and graph neural network-based multi-camera tracking model. LaMMOn consists of three main modules: (1) Language Model Detection (LMD) for object detection; (2) Language and Graph Model Association module (LGMA) for object tracking and trajectory clustering; (3) Text-to-embedding module (T2E) that overcome the problem of data limitation by synthesizing the object embedding from defined texts. LaMMOn can be run online in real-time scenarios and achieve a competitive result on many datasets, e.g., CityFlow (HOTA 76.46%), I24 (HOTA 25.7%), and TrackCUIP (HOTA 80.94%) with an acceptable FPS (from 12.20 to 13.37) for an online application.",

keywords = "Language model, Mtmct, Multi-camera tracking, Object tracking",

author = "Nguyen, {Tuan T.} and Nguyen, {Hoang H.} and Mina Sartipi and Marco Fisichella",

note = "Publisher Copyright: {\textcopyright} The Author(s) 2024.",

year = "2024",

month = sep,

doi = "10.1007/s10994-024-06592-1",

language = "English",

volume = "113",

pages = "6811–6837",

journal = "Machine learning",

issn = "0885-6125",

publisher = "Springer Netherlands",

number = "9",

}

Download

TY - JOUR

T1 - LaMMOn

T2 - language model combined graph neural network for multi-target multi-camera tracking in online scenarios

AU - Nguyen, Tuan T.

AU - Nguyen, Hoang H.

AU - Sartipi, Mina

AU - Fisichella, Marco

N1 - Publisher Copyright: © The Author(s) 2024.

PY - 2024/9

Y1 - 2024/9

N2 - Multi-target multi-camera tracking is crucial to intelligent transportation systems. Numerous recent studies have been undertaken to address this issue. Nevertheless, using the approaches in real-world situations is challenging due to the scarcity of publicly available data and the laborious process of manually annotating the new dataset and creating a tailored rule-based matching system for each camera scenario. To address this issue, we present a novel solution termed LaMMOn, an end-to-end transformer and graph neural network-based multi-camera tracking model. LaMMOn consists of three main modules: (1) Language Model Detection (LMD) for object detection; (2) Language and Graph Model Association module (LGMA) for object tracking and trajectory clustering; (3) Text-to-embedding module (T2E) that overcome the problem of data limitation by synthesizing the object embedding from defined texts. LaMMOn can be run online in real-time scenarios and achieve a competitive result on many datasets, e.g., CityFlow (HOTA 76.46%), I24 (HOTA 25.7%), and TrackCUIP (HOTA 80.94%) with an acceptable FPS (from 12.20 to 13.37) for an online application.

AB - Multi-target multi-camera tracking is crucial to intelligent transportation systems. Numerous recent studies have been undertaken to address this issue. Nevertheless, using the approaches in real-world situations is challenging due to the scarcity of publicly available data and the laborious process of manually annotating the new dataset and creating a tailored rule-based matching system for each camera scenario. To address this issue, we present a novel solution termed LaMMOn, an end-to-end transformer and graph neural network-based multi-camera tracking model. LaMMOn consists of three main modules: (1) Language Model Detection (LMD) for object detection; (2) Language and Graph Model Association module (LGMA) for object tracking and trajectory clustering; (3) Text-to-embedding module (T2E) that overcome the problem of data limitation by synthesizing the object embedding from defined texts. LaMMOn can be run online in real-time scenarios and achieve a competitive result on many datasets, e.g., CityFlow (HOTA 76.46%), I24 (HOTA 25.7%), and TrackCUIP (HOTA 80.94%) with an acceptable FPS (from 12.20 to 13.37) for an online application.

KW - Language model

KW - Mtmct

KW - Multi-camera tracking

KW - Object tracking

UR - http://www.scopus.com/inward/record.url?scp=85198649632&partnerID=8YFLogxK

U2 - 10.1007/s10994-024-06592-1

DO - 10.1007/s10994-024-06592-1

M3 - Article

AN - SCOPUS:85198649632

VL - 113

SP - 6811

EP - 6837

JO - Machine learning

JF - Machine learning

SN - 0885-6125

IS - 9

ER -

Research@Leibniz University

LaMMOn: language model combined graph neural network for multi-target multi-camera tracking in online scenarios

Authors

Research Organisations

External Research Organisations

Details

Abstract

Keywords

ASJC Scopus subject areas

Cite this

By the same author(s)

Open benchmark for filtering techniques in entity resolution

Does a language model “understand” high school math? A survey of deep learning based word problem solvers

FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

Open benchmark for filtering techniques in entity resolution

Does a language model “understand” high school math? A survey of deep learning based word problem solvers

FairTrade: Achieving Pareto-Optimal Trade-Offs between Balanced Accuracy and Fairness in Federated Learning

Harnessing Empathy and Ethics for Relevance Detection and Information Categorization in Climate and COVID-19 Tweets

A Trustworthy Approach to Classify and Analyze Epidemic-Related Information From Microblogs

Open benchmark for filtering techniques in entity resolution