信息网络安全 ›› 2025, Vol. 25 ›› Issue (4): 619-629.doi: 10.3969/j.issn.1671-1122.2025.04.010

• 专题论文:智能系统安全 • 上一篇    下一篇

基于虚假演示的隐藏后门提示攻击方法研究

顾欢欢1,2(), 李千目1, 刘臻3, 王方圆1, 姜宇4   

  1. 1.南京理工大学网络空间安全学院,南京 210094
    2.南京中新赛克科技有限责任公司,南京 211153
    3.国电南京自动化股份有限公司,南京 211106
    4.南京理工大学计算机科学与工程学院,南京 210094
  • 收稿日期:2025-01-21 出版日期:2025-04-10 发布日期:2025-04-25
  • 通讯作者: 顾欢欢 690554446@qq.com
  • 作者简介:顾欢欢(1989—),女,江苏,高级工程师,博士研究生,CCF会员,主要研究方向为网络安全技术、大模型安全技术|李千目(1979—),男,安徽,教授,博士,CCF高级会员,主要研究方向为信息安全、传感网技术应用和智能决策|刘臻(1996—),男,青海,工程师,硕士,CCF会员,主要研究方向为工业信息安全技术|王方圆(1985—),男,江苏,博士研究生,主要研究方向为黑灰产溯源与网络反欺诈对抗|姜宇(2000—),男,江苏,硕士,主要研究方向为后门攻击
  • 基金资助:
    江苏省科技成果转化专项(BA2022011)

Research on Hidden Backdoor Prompt Attack Methods Based on False Demonstrations

GU Huanhuan1,2(), LI Qianmu1, LIU Zhen3, WANG Fangyuan1, JIANG Yu4   

  1. 1. School of Cyberspace Security, Nanjing University of Science and Technology, Nanjing 210094, China
    2. Nanjing Sinovatio Technology Co., Ltd., Nanjing 211153, China
    3. Guodian Nanjing Automation Co., Ltd., Nanjing 211106, China
    4. School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
  • Received:2025-01-21 Online:2025-04-10 Published:2025-04-25

摘要:

文章提出一种基于虚假演示的隐藏后门提示攻击方法(HDPAttack),该方法以自然语言提示的整体语义为单位作为触发器,在训练数据中插入精心构造的虚假演示,这些虚假演示通过对提示进行语义再表达生成具有高语义一致性的虚假示例,引导模型在深层表示中学习特定的触发模式。与传统的后门攻击方法不同,HDPAttack不依赖稀有词汇、特殊字符或异常标记,而是通过改变提示的语言表达方式而不显著改变输入数据的语义或标签生成虚假示例,从而规避了基于显式异常特征的检测技术,使模型能够在看似正常的输入中激活隐藏的后门行为,从而提高攻击的隐蔽性和成功率。该方法在隐蔽性攻击领域具有较好的潜力,为提升后门防御技术提供了参考。

关键词: 大语言模型, 后门攻击, 提示学习

Abstract:

: This paper proposeed an HDPAttack, a hidden backdoor prompt attack method based on fake demonstrations. This method used the overall semantics of natural language prompts as a trigger. By inserting carefully crafted fake demonstrations into the training data, these fake demonstrations generated fake examples with high semantic consistency by semantically re-expressing the prompts, guiding the model to learn specific trigger patterns in deep representations. Unlike traditional backdoor attack methods, HDPAttack did not rely on rare words, special characters, or abnormal tokens. Instead, it generated fake examples by altering the linguistic expression of prompts without significantly changing the semantics or labels of the input data, thereby evading detection techniques based on explicit abnormal features. This enabled the model to activate hidden backdoor behaviors in seemingly normal inputs, improving the stealth and success rate of the attack. This method has great potential in the field of stealthy attacks and provides a new research direction for enhancing backdoor defense technologies.

Key words: pre-trained language model, backdoor attack, prompt Learning

中图分类号: