Research on the Evolution and Classification of Autoregressive Text-to-Image Generation Models
DOI:
https://doi.org/10.61173/mn03wc25Keywords:
Autoregressive Model, Text-Guided Image Generation, Cross-Modal Alignment, Dynamic Token MechanismAbstract
This paper systematically reviews the evolution and classification of auto regressive (AR) models in text-guided image generation, with a focus on four representative technical approaches: ARINAR, Token-Shuffle, SimpleAR, and LlamaGen, summarizing their strengths and limitations. The study shows that AR models, leveraging their element-by-element generation mechanism, excel in controllability and cross-modal alignment, yet remain constrained by challenges such as the trade-off between generation efficiency and resolution, insufficient complex semantic mapping, and high sensitivity during training. To address these issues, three optimization directions are proposed: introducing a dynamic token mechanism and parallel acceleration to improve generation efficiency; enhancing structured semantic modeling to refine complex semantic generation; and optimizing robust training strategies to strengthen model generalization capabilities. This research provides a theoretical foundation and technical reference for the improvement and application of AR models, contributing significantly to advancing the practical development of text-to-image generation technologies and enhancing the real-world implementation of AR-based generative models.