A Survey of Textual Adversarial Attacks and Defenses on Large Language Models (LLMs)
DOI:
https://doi.org/10.61173/m8d1p054Keywords:
Large Language Models, Adversarial Attacks, Prompt Injection, Jailbreaking Attacks, Model SecurityAbstract
The broad use of large language models (LLMs) like GPT-4 and LLaMA in dialogue systems, content generation, etc., causes security flaws of these models to rise. The current study examines progress made in recent years on textual adversarial attacks and defenses for LLMs, focusing on special attack vectors and defensive techniques targeting LLMs. Firstly, a 'Target—Technology—Scenario' three-dimensional attack classification framework that mainly consists of typical kinds of attacks including PIA and JBA. Secondly, defense mechanisms from two different perspectives: alignment enhancement in the training phase and security controls in the inference phase. In addition, by conducting experiments using benchmark datasets (HELM), existing technical drawbacks and future work directions are also discussed. The aim is to offer guidance and references for subsequent work on theoretical studies of LLM security and promoting the building of stronger NLP systems.