CNN-Transformer Hybrid Models for Object Detection: A Comprehensive Review
DOI:
https://doi.org/10.61173/19bc6r78Keywords:
CNN-Transformer Hybrid Model, Serial Ar-chitecture Fusion Approach, Parallel Architecture Fusion MethodAbstract
Initially, conventional convolutional neural networks were the primary approach for object detection, a core computer vision task. However, the emergence of Transformer architecture has significantly enhanced detection accuracy and generalization capabilities, playing a pivotal role in advancing intelligent systems across various domains. Recently, the integration of CNN and Transformer architectures has emerged as a key area of investigation for detecting objects. By combining the complementary advantages of CNNs and Transformers, these hybrid architectures enhance accuracy in various object recognition scenarios. This study commences with a concise overview of CNNs and Transformers, critically analyzing their respective advantages and limitations. Subsequently, we conduct a systematic examination of state-of-the-art hybrid architectures and their optimization strategies. Finally, a comprehensive comparison and summary are presented in tabular form to facilitate clear performance evaluation. These approaches are designed to harness CNNs’ superiority in local feature extraction while leveraging Transformers’ capacity for global context modeling. At the end of the paper, the prospects of hybrid models in object detection and the insights to guide further research have been discussed.