A Survey of Textual Adversarial Attacks and Defenses on Large Language Models (LLMs)

Sirui   Song

doi:10.61173/m8d1p054

Authors

Sirui Song Author

DOI:

https://doi.org/10.61173/m8d1p054

Keywords:

Large Language Models, Adversarial Attacks, Prompt Injection, Jailbreaking Attacks, Model Security

Abstract

The broad use of large language models (LLMs) like GPT-4 and LLaMA in dialogue systems, content generation, etc., causes security flaws of these models to rise. The current study examines progress made in recent years on textual adversarial attacks and defenses for LLMs, focusing on special attack vectors and defensive techniques targeting LLMs. Firstly, a 'Target—Technology—Scenario' three-dimensional attack classification framework that mainly consists of typical kinds of attacks including PIA and JBA. Secondly, defense mechanisms from two different perspectives: alignment enhancement in the training phase and security controls in the inference phase. In addition, by conducting experiments using benchmark datasets (HELM), existing technical drawbacks and future work directions are also discussed. The aim is to offer guidance and references for subsequent work on theoretical studies of LLM security and promoting the building of stronger NLP systems.

A Survey of Textual Adversarial Attacks and Defenses on Large Language Models (LLMs)

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section