Research on Multi-Object Tracking Technology Based on Transformer: MOTR
DOI:
https://doi.org/10.61173/gw85n885Keywords:
Multi-object tracking, Transformer, MOTR, MOTRv2, MOTRv3Abstract
Multi-object tracking (MOT) stands as a core task in computer vision, long constrained by the separated paradigm of detection and association, which leads to error accumulation and semantic fragmentation. The rise of the Transformer architecture has introduced a new paradigm for end-to-end tracking. The MOTR series of methods progressively achieves deep integration of detection and tracking through innovations such as tracking queries and dynamic supervision balancing. This paper systematically analyzes the technological evolution of MOTR, MOTRv2, and MOTRv3, conducting a three-dimensional analysis from the dimensions of architecture design, training strategies, and performance optimization. Based on experimental results from multi-scenario datasets such as DanceTrack and MOT17, this study quantitatively evaluates the performance differences among models in spatio-temporal modeling, computational efficiency, and generalization ability. The results show that MOTRv3 achieves a performance breakthrough with 70.4% HOTA in a pure end-to-end framework through three strategies: Release-Fetch Supervision (RFS), Pseudo Label Distillation (PLD), and Track Group Denoising (TGD). However, its robustness in long-term occlusion scenarios and computational costs still require optimization. Finally, combined with current technical challenges, prospective outlooks are provided for future directions such as lightweight design, cross-modal fusion, and self-supervised learning.