Research on Multi-Object Tracking Technology Based on Transformer: MOTR

Authors

  • Jingyu Li Author

DOI:

https://doi.org/10.61173/gw85n885

Keywords:

Multi-object tracking, Transformer, MOTR, MOTRv2, MOTRv3

Abstract

Multi-object tracking (MOT) stands as a core task in computer vision, long constrained by the separated paradigm of detection and association, which leads to error accumulation and semantic fragmentation. The rise of the Transformer architecture has introduced a new paradigm for end-to-end tracking. The MOTR series of methods progressively achieves deep integration of detection and tracking through innovations such as tracking queries and dynamic supervision balancing. This paper systematically analyzes the technological evolution of MOTR, MOTRv2, and MOTRv3, conducting a three-dimensional analysis from the dimensions of architecture design, training strategies, and performance optimization. Based on experimental results from multi-scenario datasets such as DanceTrack and MOT17, this study quantitatively evaluates the performance differences among models in spatio-temporal modeling, computational efficiency, and generalization ability. The results show that MOTRv3 achieves a performance breakthrough with 70.4% HOTA in a pure end-to-end framework through three strategies: Release-Fetch Supervision (RFS), Pseudo Label Distillation (PLD), and Track Group Denoising (TGD). However, its robustness in long-term occlusion scenarios and computational costs still require optimization. Finally, combined with current technical challenges, prospective outlooks are provided for future directions such as lightweight design, cross-modal fusion, and self-supervised learning.

Downloads

Published

2025-08-26

Issue

Section

Articles