Multi-target multi-camera tracking (MTMCT) has become indispensable in intelligent transportation systems, yet real-world applications are complex due to a shortage of publicly available data and laborious manual annotation. MTMCT involves tracking vehicles across multiple camera lenses, detecting objects, carrying out multi-object tracking, and finally clustering trajectories to generate a comprehensive image of vehicle movement. MTMCT faces numerous obstacles, including the necessity for new matching rules for each new camera scenario, limited datasets, and the high expense of manual labeling.
Researchers from the University of Tennessee at Chattanooga and the L3S Research Center at Leibniz University Hannover have created LaMMOn, a comprehensive multi-camera tracking model built on transformers and graph neural networks. The LaMMOn model blends three components: the Language Model Detection (LMD) for object detection, the Language and Graph Model Association (LGMA) for tracking and trajectory clustering, and the Text-to-embedding (T2E) module for crafting object embeddings from text to tackle data limitations. By leveraging synthesized embeddings from text, the design of LaMMOn removes the need for fresh matching rules and manual labeling.
A single camera used to generate tracklets traditionally carries out multi-object tracking (MOT) by correlating objects across video frames. MTMCT goes beyond this by merging object movements across several cameras, often treating MTMCT as a clustering extension of MOT results. Accurate tracking has been achieved by methods like spatial-temporal filtering and traffic law constraints. The LaMMOn model however, differs by amalgamating detection and association tasks into one process. To advance tracking performance and handle complex data structures, transformer models and graph neural networks (GNNs) have been used to enhance multi-camera tracking.
The LaMMOn model is comprised of three key components: the LMD module, which detects objects and generates embeddings; the LGMA module, which handles multi-camera tracking and trajectory clustering; and the T2E module, which forms object embeddings from textual descriptions. Creating synthetic embeddings from text by the T2E module, based on Sentencepiece, addresses data limitations and decreases labelling costs.
The LaMMOn model was tested on three MTMCT tracking datasets – CityFlow, I24, and TrackCUIP. The model outperformed other methods on the CityFlow dataset and demonstrated elevated efficiency with the I24 and TrackCUIP datasets. It achieved notable enhancements in IDF1 and HOTA compared to other baseline models whilst preserving effective FPS.
In conclusion, the innovative LaMMOn model offers an all-round multi-camera tracking solution, marrying transformers and GNNs, and surmounting the restrictions of tracking-by-detection. The model minimizes manual labeling by producing object embeddings from text descriptions. The trajectory clustering technique using LGMA, advances tracklet generation and adaptability across various traffic scenarios. Proving its real-time online capabilities, the LaMMOn model showcases competitive performance across CityFlow, I24, and TrackCUIP tracking datasets.