| [1] |
TAI Jianwei, YANG Shuangning, WANG Jiajia, et al. Survey of Adversarial Attacks and Defenses for Large Language Models[J]. Journal of Computer Research and Development, 2025, 62(3): 563-588.
|
|
台建玮, 杨双宁, 王佳佳, 等. 大语言模型对抗性攻击与防御综述[J]. 计算机研究与发展, 2025, 62(3): 563-588.
|
| [2] |
WANG Xiaochen, ZHANG Kun, ZHANG Peng. Large Model Safety and Practice from Multiple Perspectives[J]. Journal of Computer Research and Development, 2024, 61(5): 1104-1112.
|
|
王笑尘, 张坤, 张鹏. 多视角看大模型安全及实践[J]. 计算机研究与发展, 2024, 61(5): 1104-1112.
|
| [3] |
LIANG Siyuan, HE Yingzhe, LIU Aishan, et al. A Review of Jailbreak Attacks and Defenses for Large Language Models[J]. Journal of Cyber Security, 2024, 9(5): 56-86.
|
| [4] |
MU Honglin, HE Han, ZHOU Yuxin, et al. Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring[EB/OL]. (2025-03-06)[2025-06-12]. https://doi.org/10.48550/arXiv.2410.21083.
|
| [5] |
CHANG Zhiyuan, LI Mingyang, LIU Yi, et al. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues[C]// ACL. The Findings of the Association for Computational Linguistics:ACL 2024. Stroudsburg: ACL, 2024: 5135-5147.
|
| [6] |
YI Sibo, LIU Yule, SUN Zhen, et al. Jailbreak Attacks and Defenses against Large Language Models: A Survey[EB/OL]. (2024-08-30)[2025-06-12]. https://doi.org/10.48550/arXiv.2407.04295.
|
| [7] |
ROBEY A, WONG E, HASSANI H, et al. SmoothLLM: Defending Large Language Models against Jailbreaking Attacks[EB/OL]. (2024-06-11)[2025-06-12]. https://doi.org/10.48550/arXiv.2310.03684.
|
| [8] |
WANG Yihan, SHI Zhouxing, BAI A, et al. Defending LLMs against Jailbreaking Attacks via Backtranslation[C]// ACL. The 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 16031-16046.
|
| [9] |
XIE Yueqi, YI Jingwei, SHAO Jiawei, et al. Defending ChatGpt against Jailbreak Attack via Self-Reminders[J]. Nature Machine Intelligence, 2023, 5(12): 1486-1496.
|
| [10] |
XIE Yueqi, FANG Minghong, PI Renjie, et al. GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis[C]// ACL. The 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 507-518.
|
| [11] |
XU Zhangchen, JIANG Fengqing, NIU Luyao, et al. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding[C]// ACL. The 62nd Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 5587-5605.
|
| [12] |
HU Xiaomeng, CHEN P Y, HO T Y. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes[C]// NeurIPS. The 28th Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2024: 1-32.
|
| [13] |
BIANCHI F, SUZGUN M, ATTANASIO G, et al. Safety-Tuned Llamas: Lessons from Improving the Safety of Large Language Models that Follow Instructions[C]// ICLR. International Conference on Learning Representations. Washington DC: ICLR, 2024: 1-21.
|
| [14] |
TOUVRON H, MARTIN L, STONE K, et al. LLaMA 2:Open Foundation and Fine-Tuned Chat Models[EB/OL]. (2023-07-18)[2025-06-12]. https://api.semanticscholar.org/CorpusID:259950998.
|
| [15] |
DENG Boyi, WANG Wenjie, FENG Fuli, et al. Attack Prompt Generation for Red Teaming and Defending Large Language Models[C]// ACL. Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2023: 2176-2189.
|
| [16] |
RAFAILOV R, SHARMA A, MITCHELL E, et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model[J]. Advances in Neural Information Processing Systems, 2023, 36: 53728-53741.
|
| [17] |
ZHANG Chi, ZHONG Huaping, ZHANG Kuan, et al. Harnessing Diversity for Important Data Selection in Pretraining Large Language Models[EB/OL]. (2024-10-05)[2025-06-12]. https://arxiv.org/html/2409.16986v2.
|
| [18] |
LIU Pengfei, YUAN Weizhe, FU Jinlan, et al. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing[J]. ACM Computing Surveys, 2023, 55(9): 1-35.
|
| [19] |
TEAM G, GEORGIEV P, LEI V I, et al. Gemini 1.5: Unlocking Multimodal Understanding across Millions of Tokens of Context[EB/OL]. (2024-12-16)[2025-06-12]. https://arxiv.org/abs/2403.05530.
|
| [20] |
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// ACL. The 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long and Short Papers). Stroudsburg: ACL, 2019: 4171-4186.
|
| [21] |
ARDITI A, OBESO O B, SYED A, et al. Refusal in Language Models is Mediated by a Single Direction[C]// NeurIPS. The 38 Annual Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2024: 1-47.
|
| [22] |
LI Xuan, ZHOU Zhanke, ZHU Jianing, et al. DeepInception: Hypnotize Large Language Model to Be Jailbreaker[C]// NeurIPS. Conference on Neural Information Processing Systems. Cambridge: MIT Press, 2024: 1-65.
|
| [23] |
DING Peng, KUANG Jun, MA Dan, et al. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts Can Fool Large Language Models Easily[C]// ACL. North American Chapter of the Association for Computational Linguistics. Stroudsburg: ACL, 2024: 2136-2153.
|
| [24] |
YUAN Youliang, JIAO W, WANG Wenxiang, et al. GPT-4 is Too Smart to Be Safe: Stealthy Chat with LLMs via Cipher[C]// ICLR. The Twelfth International Conference on Learning Representations. Washington: ICLR, 2024: 1-21.
|
| [25] |
VIKAS S, ALANKRITA, VIPIN C, et al. Adaptive Type-2 Fuzzy Filter with Kernel Density Estimation for Impulse Noise Removal[J]. IEEE Transactions on Fuzzy Systems, 2024, 32(12): 7183-7189.
|